Skip to content

Fix worker crash caused by async promise antipattern and event ordering#1311

Merged
josephjclark merged 12 commits intorelease/nextfrom
fix/worker-crash-async-promise-antipattern
Mar 17, 2026
Merged

Fix worker crash caused by async promise antipattern and event ordering#1311
josephjclark merged 12 commits intorelease/nextfrom
fix/worker-crash-async-promise-antipattern

Conversation

@stuartc
Copy link
Member

@stuartc stuartc commented Mar 17, 2026

Short Description

Fix a production worker crash (v1.21.2) where LightningTimeoutError: [run:log] timeout killed the container as an uncaught exception. Two bugs combined: an event ordering issue in engine-multi and a new Promise(async ...) antipattern in ws-worker.

Implementation Details

The crash: During a compilation failure, engine-multi emitted WORKFLOW_ERROR before the final WORKFLOW_LOG. The ws-worker tore down the channel on the error event, so the subsequent log push timed out. The new Promise(async (resolve) => { ... }) pattern in run-log.ts swallowed the rejection — the outer Promise never settled and the error became an unhandled rejection that crashed the process.

Fixes:

  • run-log.ts — Removed unnecessary Promise wrapper; the function is already async
  • try-with-backoff.ts — Same antipattern; refactored to an inner async function
  • destroy.ts — Same antipattern; replaced with async IIFE
  • execute.ts — Moved compilation log into the worker:error handler so it emits before WORKFLOW_ERROR

Bonus: Excluded packages/engine-multi/tmp/** from the pnpm workspace to stop test fixture directories from being picked up as workspace packages and causing lockfile churn.

QA Notes

  • All fixes have regression tests that fail without the fix (RED-GREEN)
  • Full test suites pass: ws-worker (300 tests), engine-multi (205 tests)
  • The event ordering fix adds COMPILE_START/COMPILE_COMPLETE events to mock-run.ts to match real worker behavior
  • The try-with-backoff.ts refactor changes abort rejection from reject() (no arg) to throw e — the promise still rejects but now includes the error

AI Usage

  • I have used Claude Code
  • I have used another model
  • I have not used AI

stuartc added 6 commits March 17, 2026 15:39
The non-batch log path wrapped an async function in `new Promise(async
(resolve) => { ... })`. If sendEvent rejected (e.g. channel timeout),
the outer Promise never settled and the rejection was unhandled. This
was the direct cause of the worker crash: LightningTimeoutError on
run:log became an uncaught exception that killed the container.

The function is already async so the wrapper was unnecessary — replaced
with a plain for-loop that properly propagates rejections.
Same `new Promise(async ...)` antipattern as run-log.ts. If the
isCancelled callback or any other code in the catch block threw, the
error became an unhandled rejection instead of propagating to the
caller.

Replaced with an inner async function that returns its promise
directly. setTimeout retry replaced with awaited promise to keep the
flow async-native.
Same `new Promise(async (resolve) => { ... })` pattern — if
engine.destroy() or waitForRunsAndClaims rejected, the outer promise
never settled. Replaced with an async IIFE so errors propagate through
Promise.all to the caller.
During a compilation failure, engine-multi emitted WORKFLOW_ERROR
(from the worker:error handler) before the "Error occurred during
compilation" WORKFLOW_LOG (from the .catch handler). The ws-worker
tears down the channel on WORKFLOW_ERROR, so the subsequent log push
had nowhere to go — triggering the LightningTimeoutError that crashed
the container.

Moved the compilation log into the worker:error handler so it emits
before the error event. Guarded the .catch handler's log with
!didError to avoid duplication. Added COMPILE_START/COMPLETE events
to mock-run.ts to match real worker behavior.
Test fixture directories under packages/engine-multi/tmp/ were being
picked up as workspace packages by the packages/** glob, pulling in
phantom dependencies (ava@6.x and its full transitive tree) that
caused persistent lockfile churn.
@stuartc stuartc requested a review from josephjclark March 17, 2026 13:52
@github-project-automation github-project-automation bot moved this to New Issues in Core Mar 17, 2026
@josephjclark josephjclark changed the base branch from main to release/next March 17, 2026 14:30
}
resolve();
});
for (const log of logs) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this absolutely makes sense 👍

// All we're looking for here is two strings of numbers in mb
t.regex(memory?.message, /\d+mb(.+)\d+mb/i);
// All we're looking for here is a number in mb
t.regex(memory?.message, /\d+mb/i);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just pushed this unrelated fix, which will not include a changeset

The test is looking for a number like 1.2mb in the output

But by chance in CI it just output 19mb, which failed the test.

I really don't care if it's a decimal or node. Just a number followed by mb is fine.

@josephjclark josephjclark merged commit 6fd3942 into release/next Mar 17, 2026
6 checks passed
@github-project-automation github-project-automation bot moved this from New Issues to Done in Core Mar 17, 2026
@josephjclark josephjclark deleted the fix/worker-crash-async-promise-antipattern branch March 17, 2026 15:44
josephjclark added a commit that referenced this pull request Mar 17, 2026
* Fix worker crash caused by async promise antipattern and event ordering (#1311)

* fix(ws-worker): remove async promise antipattern in run-log.ts

The non-batch log path wrapped an async function in `new Promise(async
(resolve) => { ... })`. If sendEvent rejected (e.g. channel timeout),
the outer Promise never settled and the rejection was unhandled. This
was the direct cause of the worker crash: LightningTimeoutError on
run:log became an uncaught exception that killed the container.

The function is already async so the wrapper was unnecessary — replaced
with a plain for-loop that properly propagates rejections.

* fix(ws-worker): remove async promise antipattern in try-with-backoff.ts

Same `new Promise(async ...)` antipattern as run-log.ts. If the
isCancelled callback or any other code in the catch block threw, the
error became an unhandled rejection instead of propagating to the
caller.

Replaced with an inner async function that returns its promise
directly. setTimeout retry replaced with awaited promise to keep the
flow async-native.

* fix(ws-worker): remove async promise antipattern in destroy.ts

Same `new Promise(async (resolve) => { ... })` pattern — if
engine.destroy() or waitForRunsAndClaims rejected, the outer promise
never settled. Replaced with an async IIFE so errors propagate through
Promise.all to the caller.

* fix(engine-multi): emit compilation log before workflow-error

During a compilation failure, engine-multi emitted WORKFLOW_ERROR
(from the worker:error handler) before the "Error occurred during
compilation" WORKFLOW_LOG (from the .catch handler). The ws-worker
tears down the channel on WORKFLOW_ERROR, so the subsequent log push
had nowhere to go — triggering the LightningTimeoutError that crashed
the container.

Moved the compilation log into the worker:error handler so it emits
before the error event. Guarded the .catch handler's log with
!didError to avoid duplication. Added COMPILE_START/COMPLETE events
to mock-run.ts to match real worker behavior.

* chore: add changesets for worker crash fixes

* chore: exclude engine-multi tmp from pnpm workspace

Test fixture directories under packages/engine-multi/tmp/ were being
picked up as workspace packages by the packages/** glob, pulling in
phantom dependencies (ava@6.x and its full transitive tree) that
caused persistent lockfile churn.

* remove duplicate log, tidy types

* update changelogs

* tidy test

* revert diff on destroy

* keep destroy change after all and fix test

* relax runtime test

it just went flaky

---------

Co-authored-by: Joe Clark <jclark@openfn.org>

* Compile errors step name (#1313)

* engine: report compile errors with step name, not id

* changeset

* remove debug code

* Lazy state errors (#1312)

* compiler: improve error messages for lazy state

* compiler: report position on lazy state reporting

* update extra errors

* update changeset

* types

* versions

---------

Co-authored-by: Stuart Corbishley <corbish@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants