fix: prevent indefinite hangs from stale S3 datastore locks#874
Conversation
The acquire() loop in both S3Lock and FileLock had a control flow bug where the `continue` after stale lock detection bypassed the maxWaitMs timeout check, creating an infinite loop when the stale lock could not be deleted (e.g. due to S3 versioning interactions or persistent delete failures). Additionally, the global lock polling loops in acquireModelLocks() had no timeout at all. Moves the timeout check to the top of the acquire() loop so it fires on every iteration. Adds a 2x TTL timeout to the global lock polling loops in acquireModelLocks(). Adds warn-level logging to release() when lock deletion fails. Co-authored-by: Kief Morris <kief@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Adversarial Review
Critical / High
None found.
Medium
-
Tight loop without backoff when stale lock cleanup repeatedly fails —
src/infrastructure/persistence/s3_lock.ts:128-133andsrc/infrastructure/persistence/file_lock.ts:132-137When a stale lock is detected and
deleteObject/Deno.removethrows, thecontinuegoes back to the top of the loop. The path is: timeout check →putObjectConditional(fails) →readLock→headObject(stale) →deleteObject(throws) →continue. TheretryIntervalMssleep at line 139/142 is never reached on this path.With default settings (
maxWaitMs: 60_000), this means up to 60 seconds of tight async calls hammering the S3 API (4 calls per iteration) with no backoff. In CI with parallel processes hitting the same bucket, this could generate substantial unnecessary API traffic and cost.This is pre-existing behavior — the PR actually improves the situation by guaranteeing the loop terminates. But now that the loop is guaranteed to run for the full
maxWaitMsduration in this scenario (instead of hanging forever), the tight polling is more likely to be exercised in practice.Example: S3 bucket with versioning enabled, stale lock,
deleteObjectcreates delete markers but conditional puts keep failing. 60 seconds × ~100 iterations/sec = ~24,000 S3 API calls.Suggested improvement (not blocking): Add a sleep before the
continuein the stale-lock-delete-failure catch block, or fall through to the normal sleep at the bottom instead ofcontinueing.
Low
-
Recursive
acquireModelLockshas no depth limit —src/cli/repo_context.ts:699If the global lock flaps (appears right after each per-model lock acquisition, then disappears during polling), the function recurses indefinitely. The polling timeout limits the time per recursion, but not the recursion depth. In practice, stack overflow is extremely unlikely because each recursion involves substantial async work and the scenario requires adversarial lock timing. Not blocking, but worth noting.
-
globalInfo.ttlMs ?? 30_000withttlMs: 0—src/cli/repo_context.ts:592LockInfo.ttlMsis a required field so the??fallback is defensive (fine). However, if a lock were ever written withttlMs: 0, thenglobalMaxWaitMs = 0and the timeout would fire on the very first check, throwingLockTimeoutErrorimmediately even if the lock is about to be released. Unlikely in practice sincettlMs: 0would mean the lock is always stale.
Verdict
PASS — This is a well-targeted fix for a real infinite-hang bug. The timeout-at-top-of-loop approach correctly ensures the timeout check cannot be bypassed by continue jumps from stale lock cleanup. The repo_context.ts polling loops now have bounded wait times. Release logging is correctly differentiated between backends (S3 DELETE is idempotent so no NotFound filtering needed; file remove correctly filters NotFound). New tests exercise both identified failure modes (persistent delete failure, and delete-succeeds-but-lock-reappears). The changes are minimal, focused, and well-tested.
There was a problem hiding this comment.
Code Review
Clean, well-targeted bug fix. The root cause analysis is thorough and the fix is structurally sound.
Blocking Issues
None.
Suggestions
-
FileLock stale-lock timeout tests: The new S3Lock tests (
times out when stale lock cannot be deleted,timeout check fires even during stale lock retry loop) validate the exact bug being fixed. FileLock received the same structural fix (timeout check moved to top of loop) but no equivalent test was added. The existingstale lock is force-acquiredtest covers the happy path but not the failure-to-delete edge case. Consider adding a parallel test for FileLock (e.g., make the lockfile undeletable via permissions) for completeness — not blocking since the logic is symmetric and the S3 tests validate the pattern. -
Logger instantiation in error path: In both
S3Lock.release()andFileLock.release(),getSwampLogger()is called inside thecatchblock on every error. This is fine for an error path (infrequent), but ifgetSwampLoggeris expensive, you could hoist it to a class field. Very minor — not worth changing unless profiling shows an issue.
There was a problem hiding this comment.
Adversarial Review
Critical / High
None found.
Medium
-
src/cli/repo_context.ts:592—globalInfo.ttlMsused without validation for timeout calculation. If a corrupted or adversarial lock file containsttlMs: InfinityorttlMs: Number.MAX_SAFE_INTEGER, thenglobalMaxWaitMsbecomesInfinityand the timeout never fires, restoring the original hang. In practice this requires a corrupted lock file in S3/filesystem, so the blast radius is small — the lock's ownacquire()timeout (which uses a locally-configuredmaxWaitMs) would still fire when re-acquiring. But the global polling loops inacquireModelLockstrust the remote lock'sttlMsentirely. Consider clamping:Math.min((globalInfo.ttlMs ?? 30_000) * 2, 120_000). -
src/infrastructure/persistence/s3_lock.ts:103— extra S3 call on timeout path. Whenelapsed >= maxWaitMs,readLock()makes agetObjectcall just to populate the error message. If S3 is slow/unreachable (a likely scenario when locks are stuck), this adds latency to the error path. Not a correctness issue sincereadLockcatches errors and returnsnull, but it delays surfacing the timeout to the caller. The old code used the already-fetchedexistingvariable. This is a minor regression in error-path performance.
Low
src/infrastructure/persistence/s3_lock_test.ts:267,310—headObjectoverride usesmock as unknown as Record<string, unknown>pattern. This bypasses type checking on the mock, so ifheadObject's signature changes, these tests won't catch the mismatch at compile time. Not a bug today, but fragile. Existing pattern in the codebase though, so not a regression.
Verdict
PASS — The core fix is correct and well-tested. The timeout check relocation properly closes the infinite-loop bug on the stale-lock continue path. The global polling timeouts are a good hardening measure. The two medium findings are edge cases that don't block.
Summary
Fixes #872 — S3 datastore locks cause indefinite hangs when stale lock files exist in S3.
Root cause: The
acquire()loop in bothS3LockandFileLockhad a control flow bug where thecontinueafter stale lock detection bypassed themaxWaitMstimeout check at the bottom of the loop. If the stale lock couldn't be cleaned up (e.g. due to S3 versioning interactions whereDeleteObjectcreates a delete marker but conditional writes still fail, or persistent delete failures), the loop became infinite — the 60-second timeout never fired because the code never reached it. Additionally, two global lock polling loops inacquireModelLocks()had no timeout at all.What changed
S3Lock.acquire()/FileLock.acquire()— Moved themaxWaitMstimeout check from the bottom of the loop to the top, so it fires on every iteration including aftercontinuefrom stale lock cleanup. WhendeleteObjectsucceeds, the immediate retry behavior is preserved (no unnecessary sleep).acquireModelLocks()inrepo_context.ts— Added a timeout of 2× the lock's TTL (default: 60s) to both global lock polling loops. These previously hadwhile(true)with no exit condition other than the lock disappearing or staleness detection succeeding.S3Lock.release()/FileLock.release()— Added warn-level logging when lock deletion fails during release. Previously all errors were silently swallowed, making it invisible why locks were accumulating.New tests — Two new test cases in
s3_lock_test.ts:deleteObjectalways throws → verifiesLockTimeoutErrorfiresTrade-offs
elapsedis ~0ms, so there's one extra (negligible)Date.now()call per acquisition. No functional impact.LockTimeoutErrorinstead of waiting. This seems preferable to hanging indefinitely.S3LockandFileLock. These are infrastructure modules that already depend on the domain layer, so the new import to the logging module is consistent with the existing dependency direction.User impact
LockTimeoutErrorwith holder infomodel method runin CI → unusable after 3-6 operationsaws s3 rmof lock filesswamp datastore lock release --forceavailableNo changes to happy-path behavior. Lock acquisition, heartbeat, and release all work identically when things go right.
Test plan
S3Lock - times out when stale lock cannot be deletedS3Lock - timeout check fires even during stale lock retry loopdeno check,deno lint,deno fmtcleandeno run compilesucceeds🤖 Generated with Claude Code