Skip to content

fix: prevent indefinite hangs from stale S3 datastore locks#874

Merged
stack72 merged 1 commit intomainfrom
fix/s3-stale-lock-timeout-872
Mar 26, 2026
Merged

fix: prevent indefinite hangs from stale S3 datastore locks#874
stack72 merged 1 commit intomainfrom
fix/s3-stale-lock-timeout-872

Conversation

@stack72
Copy link
Contributor

@stack72 stack72 commented Mar 26, 2026

Summary

Fixes #872 — S3 datastore locks cause indefinite hangs when stale lock files exist in S3.

Root cause: The acquire() loop in both S3Lock and FileLock had a control flow bug where the continue after stale lock detection bypassed the maxWaitMs timeout check at the bottom of the loop. If the stale lock couldn't be cleaned up (e.g. due to S3 versioning interactions where DeleteObject creates a delete marker but conditional writes still fail, or persistent delete failures), the loop became infinite — the 60-second timeout never fired because the code never reached it. Additionally, two global lock polling loops in acquireModelLocks() had no timeout at all.

What changed

  • S3Lock.acquire() / FileLock.acquire() — Moved the maxWaitMs timeout check from the bottom of the loop to the top, so it fires on every iteration including after continue from stale lock cleanup. When deleteObject succeeds, the immediate retry behavior is preserved (no unnecessary sleep).

  • acquireModelLocks() in repo_context.ts — Added a timeout of 2× the lock's TTL (default: 60s) to both global lock polling loops. These previously had while(true) with no exit condition other than the lock disappearing or staleness detection succeeding.

  • S3Lock.release() / FileLock.release() — Added warn-level logging when lock deletion fails during release. Previously all errors were silently swallowed, making it invisible why locks were accumulating.

  • New tests — Two new test cases in s3_lock_test.ts:

    • Stale lock where deleteObject always throws → verifies LockTimeoutError fires
    • Stale lock where delete "succeeds" but lock reappears (S3 versioning scenario) → verifies timeout fires instead of infinite loop

Trade-offs

  • The timeout check moved from bottom-of-loop to top-of-loop means on the very first iteration elapsed is ~0ms, so there's one extra (negligible) Date.now() call per acquisition. No functional impact.
  • The global lock polling timeout (2× TTL = 60s) is a balance: long enough to accommodate legitimate structural commands, short enough to recover from stale locks. If a structural command legitimately holds the global lock for >60s, per-model operations will now throw LockTimeoutError instead of waiting. This seems preferable to hanging indefinitely.
  • Lock release logging adds a logger dependency to S3Lock and FileLock. These are infrastructure modules that already depend on the domain layer, so the new import to the logging module is consistent with the existing dependency direction.

User impact

Before After
Stale S3 lock → indefinite hang, CI job stuck forever Recovers within 60s or throws clear LockTimeoutError with holder info
Sequential model method run in CI → unusable after 3-6 operations Works reliably
Lock release failures → silent, no visibility Warn-level log message with error details
Workaround: manual aws s3 rm of lock files No workaround needed; also swamp datastore lock release --force available

No changes to happy-path behavior. Lock acquisition, heartbeat, and release all work identically when things go right.

Test plan

  • New unit test: S3Lock - times out when stale lock cannot be deleted
  • New unit test: S3Lock - timeout check fires even during stale lock retry loop
  • All existing S3Lock tests pass (13/13)
  • Full test suite passes (3559/3559)
  • deno check, deno lint, deno fmt clean
  • deno run compile succeeds

🤖 Generated with Claude Code

The acquire() loop in both S3Lock and FileLock had a control flow bug
where the `continue` after stale lock detection bypassed the maxWaitMs
timeout check, creating an infinite loop when the stale lock could not
be deleted (e.g. due to S3 versioning interactions or persistent delete
failures). Additionally, the global lock polling loops in
acquireModelLocks() had no timeout at all.

Moves the timeout check to the top of the acquire() loop so it fires
on every iteration. Adds a 2x TTL timeout to the global lock polling
loops in acquireModelLocks(). Adds warn-level logging to release() when
lock deletion fails.

Co-authored-by: Kief Morris <kief@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adversarial Review

Critical / High

None found.

Medium

  1. Tight loop without backoff when stale lock cleanup repeatedly failssrc/infrastructure/persistence/s3_lock.ts:128-133 and src/infrastructure/persistence/file_lock.ts:132-137

    When a stale lock is detected and deleteObject/Deno.remove throws, the continue goes back to the top of the loop. The path is: timeout check → putObjectConditional (fails) → readLockheadObject (stale) → deleteObject (throws) → continue. The retryIntervalMs sleep at line 139/142 is never reached on this path.

    With default settings (maxWaitMs: 60_000), this means up to 60 seconds of tight async calls hammering the S3 API (4 calls per iteration) with no backoff. In CI with parallel processes hitting the same bucket, this could generate substantial unnecessary API traffic and cost.

    This is pre-existing behavior — the PR actually improves the situation by guaranteeing the loop terminates. But now that the loop is guaranteed to run for the full maxWaitMs duration in this scenario (instead of hanging forever), the tight polling is more likely to be exercised in practice.

    Example: S3 bucket with versioning enabled, stale lock, deleteObject creates delete markers but conditional puts keep failing. 60 seconds × ~100 iterations/sec = ~24,000 S3 API calls.

    Suggested improvement (not blocking): Add a sleep before the continue in the stale-lock-delete-failure catch block, or fall through to the normal sleep at the bottom instead of continueing.

Low

  1. Recursive acquireModelLocks has no depth limitsrc/cli/repo_context.ts:699

    If the global lock flaps (appears right after each per-model lock acquisition, then disappears during polling), the function recurses indefinitely. The polling timeout limits the time per recursion, but not the recursion depth. In practice, stack overflow is extremely unlikely because each recursion involves substantial async work and the scenario requires adversarial lock timing. Not blocking, but worth noting.

  2. globalInfo.ttlMs ?? 30_000 with ttlMs: 0src/cli/repo_context.ts:592

    LockInfo.ttlMs is a required field so the ?? fallback is defensive (fine). However, if a lock were ever written with ttlMs: 0, then globalMaxWaitMs = 0 and the timeout would fire on the very first check, throwing LockTimeoutError immediately even if the lock is about to be released. Unlikely in practice since ttlMs: 0 would mean the lock is always stale.

Verdict

PASS — This is a well-targeted fix for a real infinite-hang bug. The timeout-at-top-of-loop approach correctly ensures the timeout check cannot be bypassed by continue jumps from stale lock cleanup. The repo_context.ts polling loops now have bounded wait times. Release logging is correctly differentiated between backends (S3 DELETE is idempotent so no NotFound filtering needed; file remove correctly filters NotFound). New tests exercise both identified failure modes (persistent delete failure, and delete-succeeds-but-lock-reappears). The changes are minimal, focused, and well-tested.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Clean, well-targeted bug fix. The root cause analysis is thorough and the fix is structurally sound.

Blocking Issues

None.

Suggestions

  1. FileLock stale-lock timeout tests: The new S3Lock tests (times out when stale lock cannot be deleted, timeout check fires even during stale lock retry loop) validate the exact bug being fixed. FileLock received the same structural fix (timeout check moved to top of loop) but no equivalent test was added. The existing stale lock is force-acquired test covers the happy path but not the failure-to-delete edge case. Consider adding a parallel test for FileLock (e.g., make the lockfile undeletable via permissions) for completeness — not blocking since the logic is symmetric and the S3 tests validate the pattern.

  2. Logger instantiation in error path: In both S3Lock.release() and FileLock.release(), getSwampLogger() is called inside the catch block on every error. This is fine for an error path (infrequent), but if getSwampLogger is expensive, you could hoist it to a class field. Very minor — not worth changing unless profiling shows an issue.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adversarial Review

Critical / High

None found.

Medium

  1. src/cli/repo_context.ts:592globalInfo.ttlMs used without validation for timeout calculation. If a corrupted or adversarial lock file contains ttlMs: Infinity or ttlMs: Number.MAX_SAFE_INTEGER, then globalMaxWaitMs becomes Infinity and the timeout never fires, restoring the original hang. In practice this requires a corrupted lock file in S3/filesystem, so the blast radius is small — the lock's own acquire() timeout (which uses a locally-configured maxWaitMs) would still fire when re-acquiring. But the global polling loops in acquireModelLocks trust the remote lock's ttlMs entirely. Consider clamping: Math.min((globalInfo.ttlMs ?? 30_000) * 2, 120_000).

  2. src/infrastructure/persistence/s3_lock.ts:103 — extra S3 call on timeout path. When elapsed >= maxWaitMs, readLock() makes a getObject call just to populate the error message. If S3 is slow/unreachable (a likely scenario when locks are stuck), this adds latency to the error path. Not a correctness issue since readLock catches errors and returns null, but it delays surfacing the timeout to the caller. The old code used the already-fetched existing variable. This is a minor regression in error-path performance.

Low

  1. src/infrastructure/persistence/s3_lock_test.ts:267,310headObject override uses mock as unknown as Record<string, unknown> pattern. This bypasses type checking on the mock, so if headObject's signature changes, these tests won't catch the mismatch at compile time. Not a bug today, but fragile. Existing pattern in the codebase though, so not a regression.

Verdict

PASS — The core fix is correct and well-tested. The timeout check relocation properly closes the infinite-loop bug on the stale-lock continue path. The global polling timeouts are a good hardening measure. The two medium findings are edge cases that don't block.

@stack72 stack72 merged commit 17ae7ae into main Mar 26, 2026
19 of 20 checks passed
@stack72 stack72 deleted the fix/s3-stale-lock-timeout-872 branch March 26, 2026 12:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Swamp S3 Datastore: Stale Locks Cause Indefinite Hangs

1 participant