fix: prevent indefinite hangs from stale S3 datastore locks by stack72 · Pull Request #874 · systeminit/swamp

stack72 · 2026-03-26T11:44:50Z

Summary

Fixes #872 — S3 datastore locks cause indefinite hangs when stale lock files exist in S3.

Root cause: The acquire() loop in both S3Lock and FileLock had a control flow bug where the continue after stale lock detection bypassed the maxWaitMs timeout check at the bottom of the loop. If the stale lock couldn't be cleaned up (e.g. due to S3 versioning interactions where DeleteObject creates a delete marker but conditional writes still fail, or persistent delete failures), the loop became infinite — the 60-second timeout never fired because the code never reached it. Additionally, two global lock polling loops in acquireModelLocks() had no timeout at all.

What changed

S3Lock.acquire() / FileLock.acquire() — Moved the maxWaitMs timeout check from the bottom of the loop to the top, so it fires on every iteration including after continue from stale lock cleanup. When deleteObject succeeds, the immediate retry behavior is preserved (no unnecessary sleep).
acquireModelLocks() in repo_context.ts — Added a timeout of 2× the lock's TTL (default: 60s) to both global lock polling loops. These previously had while(true) with no exit condition other than the lock disappearing or staleness detection succeeding.
S3Lock.release() / FileLock.release() — Added warn-level logging when lock deletion fails during release. Previously all errors were silently swallowed, making it invisible why locks were accumulating.
New tests — Two new test cases in s3_lock_test.ts:
- Stale lock where deleteObject always throws → verifies LockTimeoutError fires
- Stale lock where delete "succeeds" but lock reappears (S3 versioning scenario) → verifies timeout fires instead of infinite loop

Trade-offs

The timeout check moved from bottom-of-loop to top-of-loop means on the very first iteration elapsed is ~0ms, so there's one extra (negligible) Date.now() call per acquisition. No functional impact.
The global lock polling timeout (2× TTL = 60s) is a balance: long enough to accommodate legitimate structural commands, short enough to recover from stale locks. If a structural command legitimately holds the global lock for >60s, per-model operations will now throw LockTimeoutError instead of waiting. This seems preferable to hanging indefinitely.
Lock release logging adds a logger dependency to S3Lock and FileLock. These are infrastructure modules that already depend on the domain layer, so the new import to the logging module is consistent with the existing dependency direction.

User impact

Before	After
Stale S3 lock → indefinite hang, CI job stuck forever	Recovers within 60s or throws clear `LockTimeoutError` with holder info
Sequential `model method run` in CI → unusable after 3-6 operations	Works reliably
Lock release failures → silent, no visibility	Warn-level log message with error details
Workaround: manual `aws s3 rm` of lock files	No workaround needed; also `swamp datastore lock release --force` available

No changes to happy-path behavior. Lock acquisition, heartbeat, and release all work identically when things go right.

Test plan

New unit test: S3Lock - times out when stale lock cannot be deleted
New unit test: S3Lock - timeout check fires even during stale lock retry loop
All existing S3Lock tests pass (13/13)
Full test suite passes (3559/3559)
deno check, deno lint, deno fmt clean
deno run compile succeeds

🤖 Generated with Claude Code

The acquire() loop in both S3Lock and FileLock had a control flow bug where the `continue` after stale lock detection bypassed the maxWaitMs timeout check, creating an infinite loop when the stale lock could not be deleted (e.g. due to S3 versioning interactions or persistent delete failures). Additionally, the global lock polling loops in acquireModelLocks() had no timeout at all. Moves the timeout check to the top of the acquire() loop so it fires on every iteration. Adds a 2x TTL timeout to the global lock polling loops in acquireModelLocks(). Adds warn-level logging to release() when lock deletion fails. Co-authored-by: Kief Morris <kief@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions

Adversarial Review

Critical / High

None found.

Medium

Tight loop without backoff when stale lock cleanup repeatedly fails — src/infrastructure/persistence/s3_lock.ts:128-133 and src/infrastructure/persistence/file_lock.ts:132-137

When a stale lock is detected and deleteObject/Deno.remove throws, the continue goes back to the top of the loop. The path is: timeout check → putObjectConditional (fails) → readLock → headObject (stale) → deleteObject (throws) → continue. The retryIntervalMs sleep at line 139/142 is never reached on this path.

With default settings (maxWaitMs: 60_000), this means up to 60 seconds of tight async calls hammering the S3 API (4 calls per iteration) with no backoff. In CI with parallel processes hitting the same bucket, this could generate substantial unnecessary API traffic and cost.

This is pre-existing behavior — the PR actually improves the situation by guaranteeing the loop terminates. But now that the loop is guaranteed to run for the full maxWaitMs duration in this scenario (instead of hanging forever), the tight polling is more likely to be exercised in practice.

Example: S3 bucket with versioning enabled, stale lock, deleteObject creates delete markers but conditional puts keep failing. 60 seconds × ~100 iterations/sec = ~24,000 S3 API calls.

Suggested improvement (not blocking): Add a sleep before the continue in the stale-lock-delete-failure catch block, or fall through to the normal sleep at the bottom instead of continueing.

Low

Recursive acquireModelLocks has no depth limit — src/cli/repo_context.ts:699

If the global lock flaps (appears right after each per-model lock acquisition, then disappears during polling), the function recurses indefinitely. The polling timeout limits the time per recursion, but not the recursion depth. In practice, stack overflow is extremely unlikely because each recursion involves substantial async work and the scenario requires adversarial lock timing. Not blocking, but worth noting.
globalInfo.ttlMs ?? 30_000 with ttlMs: 0 — src/cli/repo_context.ts:592

LockInfo.ttlMs is a required field so the ?? fallback is defensive (fine). However, if a lock were ever written with ttlMs: 0, then globalMaxWaitMs = 0 and the timeout would fire on the very first check, throwing LockTimeoutError immediately even if the lock is about to be released. Unlikely in practice since ttlMs: 0 would mean the lock is always stale.

Verdict

PASS — This is a well-targeted fix for a real infinite-hang bug. The timeout-at-top-of-loop approach correctly ensures the timeout check cannot be bypassed by continue jumps from stale lock cleanup. The repo_context.ts polling loops now have bounded wait times. Release logging is correctly differentiated between backends (S3 DELETE is idempotent so no NotFound filtering needed; file remove correctly filters NotFound). New tests exercise both identified failure modes (persistent delete failure, and delete-succeeds-but-lock-reappears). The changes are minimal, focused, and well-tested.

github-actions

Code Review

Clean, well-targeted bug fix. The root cause analysis is thorough and the fix is structurally sound.

Blocking Issues

None.

Suggestions

FileLock stale-lock timeout tests: The new S3Lock tests (times out when stale lock cannot be deleted, timeout check fires even during stale lock retry loop) validate the exact bug being fixed. FileLock received the same structural fix (timeout check moved to top of loop) but no equivalent test was added. The existing stale lock is force-acquired test covers the happy path but not the failure-to-delete edge case. Consider adding a parallel test for FileLock (e.g., make the lockfile undeletable via permissions) for completeness — not blocking since the logic is symmetric and the S3 tests validate the pattern.
Logger instantiation in error path: In both S3Lock.release() and FileLock.release(), getSwampLogger() is called inside the catch block on every error. This is fine for an error path (infrequent), but if getSwampLogger is expensive, you could hoist it to a class field. Very minor — not worth changing unless profiling shows an issue.

github-actions

Adversarial Review

Critical / High

None found.

Medium

src/cli/repo_context.ts:592 — globalInfo.ttlMs used without validation for timeout calculation. If a corrupted or adversarial lock file contains ttlMs: Infinity or ttlMs: Number.MAX_SAFE_INTEGER, then globalMaxWaitMs becomes Infinity and the timeout never fires, restoring the original hang. In practice this requires a corrupted lock file in S3/filesystem, so the blast radius is small — the lock's own acquire() timeout (which uses a locally-configured maxWaitMs) would still fire when re-acquiring. But the global polling loops in acquireModelLocks trust the remote lock's ttlMs entirely. Consider clamping: Math.min((globalInfo.ttlMs ?? 30_000) * 2, 120_000).
src/infrastructure/persistence/s3_lock.ts:103 — extra S3 call on timeout path. When elapsed >= maxWaitMs, readLock() makes a getObject call just to populate the error message. If S3 is slow/unreachable (a likely scenario when locks are stuck), this adds latency to the error path. Not a correctness issue since readLock catches errors and returns null, but it delays surfacing the timeout to the caller. The old code used the already-fetched existing variable. This is a minor regression in error-path performance.

Low

src/infrastructure/persistence/s3_lock_test.ts:267,310 — headObject override uses mock as unknown as Record<string, unknown> pattern. This bypasses type checking on the mock, so if headObject's signature changes, these tests won't catch the mismatch at compile time. Not a bug today, but fragile. Existing pattern in the codebase though, so not a regression.

Verdict

PASS — The core fix is correct and well-tested. The timeout check relocation properly closes the infinite-loop bug on the stale-lock continue path. The global polling timeouts are a good hardening measure. The two medium findings are edge cases that don't block.

stack72 mentioned this pull request Mar 26, 2026

Swamp S3 Datastore: Stale Locks Cause Indefinite Hangs #872

Closed

github-actions bot reviewed Mar 26, 2026

View reviewed changes

github-actions bot approved these changes Mar 26, 2026

View reviewed changes

github-actions bot reviewed Mar 26, 2026

View reviewed changes

stack72 merged commit 17ae7ae into main Mar 26, 2026
19 of 20 checks passed

stack72 deleted the fix/s3-stale-lock-timeout-872 branch March 26, 2026 12:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent indefinite hangs from stale S3 datastore locks#874

fix: prevent indefinite hangs from stale S3 datastore locks#874
stack72 merged 1 commit intomainfrom
fix/s3-stale-lock-timeout-872

stack72 commented Mar 26, 2026

Uh oh!

github-actions bot left a comment

Uh oh!

github-actions bot left a comment

Uh oh!

github-actions bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

stack72 commented Mar 26, 2026

Summary

What changed

Trade-offs

User impact

Test plan

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Adversarial Review

Critical / High

Medium

Low

Verdict

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Code Review

Blocking Issues

Suggestions

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Adversarial Review

Critical / High

Medium

Low

Verdict

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant