fix: resolve pause-resume race condition (#3081)#3489
fix: resolve pause-resume race condition (#3081)#3489MaxwellCalkin wants to merge 1 commit intosimstudioai:mainfrom
Conversation
…mstudioai#3081) When a resume request arrives before persistPauseResult finishes committing the paused execution record, enqueueOrStartResume fails with "Paused execution not found". This is critical for high-throughput automation where external systems resume milliseconds after pause. Two complementary fixes: 1. Wrap persistPauseResult's DB insert + queue processing in a single transaction so the pause record and any pending resume claims are atomically visible. The SELECT FOR UPDATE in enqueueOrStartResume will block on the row lock until this transaction commits. 2. Add exponential backoff retry (5 attempts, 50ms base delay) in enqueueOrStartResume for the specific "not found" error, handling the window where the row doesn't exist yet and SELECT FOR UPDATE can't acquire a lock on a non-existent row. Only the race-condition-specific error is retried; other errors (already resumed, snapshot not ready, etc.) fail immediately. Closes simstudioai#3081 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PR SummaryMedium Risk Overview Adds exponential-backoff retry to Written by Cursor Bugbot for commit 1e4b3e0. This will update automatically on new commits. Configure here. |
|
The latest updates on your projects. Learn more about Vercel for GitHub. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| pausePoints: sql`jsonb_set(pause_points, ARRAY[${pendingEntry.contextId}, 'resumeStatus'], '"resuming"'::jsonb)`, | ||
| }) | ||
| .where(eq(pausedExecutions.executionId, executionId)) | ||
| } |
There was a problem hiding this comment.
Claimed resume entry never started, permanently orphaned
High Severity
Inside the persistPauseResult transaction, a pending resume queue entry is claimed (status set to 'claimed') and the pause point is marked as 'resuming', but no code ever starts the actual resume execution for that entry. After the transaction, processQueuedResumes only queries for entries with status 'pending', so the newly claimed entry is invisible to it. This orphans the resume request permanently in 'claimed' state — the user's resume action silently disappears.
Additional Locations (1)
Greptile SummaryThis PR fixes a pause-resume race condition (#3081) in
Key observations:
Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Exec as Workflow Executor
participant PRM as PauseResumeManager
participant DB as Database (PostgreSQL)
participant API as Resume API
Exec->>PRM: persistPauseResult(...)
PRM->>DB: BEGIN TRANSACTION
PRM->>DB: INSERT pausedExecutions (row locked until commit)
PRM->>DB: SELECT pending resumeQueue entries
alt pending entry found (queued before commit)
PRM->>DB: UPDATE resumeQueue → 'claimed'
PRM->>DB: UPDATE pausedExecutions.pausePoints → 'resuming'
end
PRM->>DB: COMMIT (row now visible)
PRM->>PRM: processQueuedResumes(executionId)
note over API,DB: Race: resume arrives before COMMIT
API->>PRM: enqueueOrStartResume(...)
loop Retry (up to 5 attempts, 50ms×2^n backoff)
PRM->>DB: BEGIN TRANSACTION
PRM->>DB: SELECT FOR UPDATE pausedExecutions WHERE executionId=?
alt row not yet visible (pre-commit race window)
DB-->>PRM: empty result
PRM-->>PRM: throw "Paused execution not found"
PRM->>PRM: wait delay, retry
else row visible and locked
DB-->>PRM: row (blocks if persistPauseResult tx still open)
PRM->>DB: INSERT resumeQueue → 'claimed'
PRM->>DB: UPDATE pausedExecutions.pausePoints → 'resuming'
PRM->>DB: COMMIT
PRM-->>API: { status: 'starting', resumeExecutionId }
end
end
Last reviewed commit: 1e4b3e0 |
| const isPausedNotFound = | ||
| error instanceof Error && | ||
| error.message === 'Paused execution not found or already resumed' |
There was a problem hiding this comment.
Fragile string-based error type detection
The retry guard uses exact string comparison against the error message to distinguish race-condition errors from other errors. This couples the retry logic tightly to the exact wording of the error thrown in tryEnqueueOrStartResume. If either string is ever updated independently (e.g., for clarity), the retry will silently stop working without any compile-time or test-time signal.
A more robust approach would be to use a typed error class or an error code:
// Define once, used in both throw and catch sites
export class PausedExecutionNotFoundError extends Error {
readonly code = 'PAUSED_EXECUTION_NOT_FOUND'
constructor() {
super('Paused execution not found or already resumed')
}
}Then the guard becomes:
const isPausedNotFound = error instanceof PausedExecutionNotFoundErrorAnd the throw site in tryEnqueueOrStartResume:
throw new PausedExecutionNotFoundError()This makes the relationship between the throw site and the catch site explicit and refactor-safe.
| /** | ||
| * Base delay in milliseconds between retry attempts. Each retry doubles | ||
| * the delay (exponential backoff): 50ms, 100ms, 200ms, 400ms, 800ms. | ||
| */ | ||
| private static readonly RESUME_RETRY_BASE_DELAY_MS = 50 |
There was a problem hiding this comment.
JSDoc comment lists a delay that never occurs
The comment says the delays are 50ms, 100ms, 200ms, 400ms, 800ms (five values), but only four delays actually happen. On the final attempt (attempt = 4), the condition attempt >= RESUME_RETRY_MAX_ATTEMPTS - 1 is immediately true, so the error is re-thrown without waiting — the 800ms delay is never applied.
| /** | |
| * Base delay in milliseconds between retry attempts. Each retry doubles | |
| * the delay (exponential backoff): 50ms, 100ms, 200ms, 400ms, 800ms. | |
| */ | |
| private static readonly RESUME_RETRY_BASE_DELAY_MS = 50 | |
| /** | |
| * Base delay in milliseconds between retry attempts. Each retry doubles | |
| * the delay (exponential backoff): 50ms, 100ms, 200ms, 400ms. | |
| * The final (5th) attempt throws immediately on failure with no additional delay. | |
| */ | |
| private static readonly RESUME_RETRY_BASE_DELAY_MS = 50 |
| const emptyTerminal = () => ({ | ||
| limit: vi.fn().mockReturnValue({ | ||
| then: vi | ||
| .fn() | ||
| .mockImplementation((resolve: (rows: unknown[]) => unknown) => resolve([])), | ||
| }), | ||
| then: vi | ||
| .fn() | ||
| .mockImplementation((resolve: (rows: unknown[]) => unknown) => resolve([])), | ||
| }) |
There was a problem hiding this comment.
Unused helper function emptyTerminal
emptyTerminal is declared but never called anywhere in the mock or tests. It appears to be a leftover from an earlier draft of the mock helper. This will likely trigger a linter warning (@typescript-eslint/no-unused-vars) and adds noise.
| const emptyTerminal = () => ({ | |
| limit: vi.fn().mockReturnValue({ | |
| then: vi | |
| .fn() | |
| .mockImplementation((resolve: (rows: unknown[]) => unknown) => resolve([])), | |
| }), | |
| then: vi | |
| .fn() | |
| .mockImplementation((resolve: (rows: unknown[]) => unknown) => resolve([])), | |
| }) | |
| // selectCallCount tracks which select() invocation we're in to differentiate | |
| // pausedExecution lookups (with .for('update')) from activeResume lookups |
(Remove the emptyTerminal declaration entirely and replace with the comment above if clarification is needed.)


Summary
Fixes the race condition where a resume request arriving before
persistPauseResultcommits the paused execution record causesenqueueOrStartResumeto fail with "Paused execution not found" (issue #3081).Two complementary fixes in
apps/sim/lib/workflows/executor/human-in-the-loop-manager.ts:Transaction wrapping in
persistPauseResult: The DB insert and queue processing are now wrapped in a single transaction, so the pause record and any pending resume claims become atomically visible. TheSELECT FOR UPDATEinenqueueOrStartResumewill block on the row lock until this transaction commits, eliminating the race window.Exponential backoff retry in
enqueueOrStartResume: Adds retry logic (5 attempts, 50ms base delay doubling each attempt) specifically for the "Paused execution not found" error. This handles the edge case where the row doesn't yet exist andSELECT FOR UPDATEcan't acquire a lock on a non-existent row. Other errors (already resumed, snapshot not ready, etc.) fail immediately without retrying.Files changed
apps/sim/lib/workflows/executor/human-in-the-loop-manager.ts— core fixapps/sim/lib/workflows/executor/human-in-the-loop-manager.test.ts— 4 tests for retry behaviorTest plan
🤖 Generated with Claude Code