fix: check for existing buildkitd before mounting sticky disk#71
fix: check for existing buildkitd before mounting sticky disk#71
Conversation
When setup-docker-builder is invoked twice in the same job (e.g. via a composite action called twice), the second invocation was calling setupStickyDisk() before detecting the already-running buildkitd. This caused a new sticky disk to be mounted on top of /var/lib/buildkit while buildkitd was still running with in-memory metadata referencing snapshot directories from the original disk. The subsequent build then failed with: ERROR: failed to solve: failed to read dockerfile: failed to walk: resolve: lstat /var/lib/buildkit/runc-overlayfs/snapshots/snapshots/N: no such file or directory Fix: move the buildkitd process check to the very beginning of startBlacksmithBuilder(), before any sticky disk setup. If buildkitd is already running, log an informational message and return immediately so the fallback path reuses the existing configured builder (from the first invocation) without corrupting its overlayfs snapshot state.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Why? ==== When setup-docker-builder is called twice in one job, the second instance's post-action cleanup kills the buildkitd process that the first instance started. GitHub Actions runs post steps in reverse order, so the second cleanup runs first and tears down buildkitd before the first instance's builds can export their build records. This causes "connection refused" errors in the first build's post step and "buildkitd has crashed" warnings in the first setup's cleanup. How? ==== - Inspected the failing job logs from the ee-monorepo CI run to trace the exact sequence: Post Setup (2nd) kills buildkitd, then Post Build (1st) gets connection refused. - Confirmed that `startBlacksmithBuilder` returns `addr: null` when it detects existing buildkitd, which means `buildkitdAddr` is never saved to state for the second instance. - Used `buildkitdAddr` presence in state as the ownership signal — only the instance that started buildkitd should shut it down. - Extracted the shutdown logic into `maybeShutdownBuildkitd()` with early returns to flatten the deeply nested if/try/catch structure, and deduplicated the crash-log printing into `logBuildkitdCrashLogs()`.
|
When you rebuild I will test this again! |
|
@chadxz - sorry for the delay you should be good to go here! |
|
Tested this against our monorepo where we hit the original bug. We have composite actions that call Set up a workflow that calls The logs show the fix working as expected -- first invocation starts buildkitd, second invocation detects it and bails out early: Cleanup also looked correct -- the second post-step skipped shutdown since it did not start buildkitd, and the first post-step shut it down gracefully. Full job logs: https://app.blacksmith.sh/convergint/runs/23514311534/jobs/68442899155 |
When setup-docker-builder is invoked twice in the same job (e.g. via a composite action called twice), the second invocation was calling setupStickyDisk() before detecting the already-running buildkitd. This caused a new sticky disk to be mounted on top of /var/lib/buildkit while buildkitd was still running with in-memory metadata referencing snapshot directories from the original disk. The subsequent build then failed with:
Fix: move the buildkitd process check to the very beginning of startBlacksmithBuilder(), before any sticky disk setup. If buildkitd is already running, log an informational message and return immediately so the fallback path reuses the existing configured builder (from the first invocation) without corrupting its overlayfs snapshot state.
Based on #65 by @chadxz, with dist rebuilt.
Note
Medium Risk
Changes builder initialization control flow to early-exit when
buildkitdis already running, which affects how repeated action invocations behave and could impact workflows that relied on re-initialization side effects. Scope is localized to setup sequencing and logging, with no new external interfaces.Overview
Prevents repeated
setup-docker-builderinvocations in the same job from remounting the sticky disk over an already-runningbuildkitd.startBlacksmithBuilder()now checks for an existingbuildkitdprocess before callingsetupStickyDisk(); if found, it logs and returns early so the action reuses the already-configured builder instead of corrupting BuildKit’s on-disk snapshot state.Written by Cursor Bugbot for commit 730dff6. This will update automatically on new commits. Configure here.