Skip to content

Rewrite TestChildProcessCleanup with socket-based deterministic liveness probe#2265

Draft
maxisbey wants to merge 3 commits intomainfrom
fix/flaky-child-process-cleanup-polling
Draft

Rewrite TestChildProcessCleanup with socket-based deterministic liveness probe#2265
maxisbey wants to merge 3 commits intomainfrom
fix/flaky-child-process-cleanup-polling

Conversation

@maxisbey
Copy link
Contributor

@maxisbey maxisbey commented Mar 10, 2026

Summary

Closes #1775. Supersedes #2071.

The three TestChildProcessCleanup tests in tests/client/test_stdio.py failed intermittently on Windows/macOS CI. The root cause was a fundamentally timing-dependent test design: watching a file grow on disk as a proxy for "is the child process alive?", with fixed anyio.sleep() durations to "wait long enough" for nested Python interpreters to start. On loaded CI, "long enough" wasn't.

A secondary bug compounded this: when the flaky assertion failed, proc was never terminated — the finally blocks only cleaned up tempfiles. The leaked subprocess was GC'd during a later test, triggering PytestUnraisableExceptionWarning there (observed: test_call_tool — an in-memory test that never touches subprocesses — failing with ResourceWarning: subprocess 5592 is still running).

New design — socket-based, zero sleeps, zero polling

Each subprocess in the tree connects a TCP socket back to a listener owned by the test. Two kernel-guaranteed blocking-I/O signals then tell us everything:

  1. await listener.accept() blocks until the subprocess connects → proves it started. No polling for "did the file grow yet?".
  2. await stream.receive(1) after _terminate_process_tree() raises anyio.EndOfStream because the kernel closes all file descriptors — including sockets — when a process terminates. This is a direct, OS-level proof that the child is dead. No polling for "did the file stop growing?".

Every synchronization point is a blocking read that unblocks on a kernel event. The anyio.fail_after() wrappers are upper bounds for catastrophic failure (subprocess never started / survived termination), not timing assumptions.

# Old: fixed sleep, indirect proxy, races against CI load
await anyio.sleep(0.5)
initial_size = os.path.getsize(marker_file)
await anyio.sleep(0.3)
assert os.path.getsize(marker_file) > initial_size  # fails: assert 0 > 0

# New: blocking accept, kernel-level EOF signal
with anyio.fail_after(10):
    stream = await sock.accept()           # blocks until child connects
    assert await stream.receive(5) == b"alive"

await _terminate_process_tree(proc)

with anyio.fail_after(5):
    await stream.receive(1)                # raises EndOfStream: child is dead

Secondary fix: process cleanup in finally

Each test now terminates proc in finally if the test body's own termination didn't run (guarded by proc = None after successful in-test termination to avoid noisy double-terminate). anyio.move_on_after(5) ensures cleanup can't mask the original failure.

Also removed

  • tempfile import and all file I/O
  • escape_path_for_python import (no longer need path escaping for embedded scripts — scripts are now injected via {code!r} repr() literals, which eliminates all nested-quote escaping)
  • ~30 lines of boilerplate textwrap.dedent blocks with triply-nested f-string scripts
  • The parent_marker check from test 1, which was a no-op: NamedTemporaryFile(delete=False) creates the file, so os.path.exists() was always True

Results

Metric Before After
3 tests, single run 7.08s 0.46s (15× faster)
30 runs × 4 parallel workers (flake-finder) 2.08s, 30/30 pass
Full ./scripts/test 100% coverage
Diff +168 −248 (net −80 lines)
anyio.sleep() calls in TestChildProcessCleanup 12 0

AI Disclaimer

…c in finally

The three TestChildProcessCleanup tests were flaky on Windows/macOS CI
because they used fixed anyio.sleep() durations (0.5s + 0.3s = 800ms)
to wait for nested Python interpreter chains to start writing to marker
files. On loaded CI runners, two Python startups + subprocess.Popen +
open() + first write() can exceed 800ms, causing assert 0 > 0.

Secondary bug: when the assertion failed, proc was never terminated
(finally only cleaned up tempfiles). The leaked subprocess was GC'd
during a later test, triggering PytestUnraisableExceptionWarning and
causing knock-on failures in unrelated tests.

Changes:

- Added _wait_for_file_size() helper: polls until getsize(path) exceeds
  a threshold, bounded by anyio.fail_after(10). Replaces all startup
  sleep+assert chains. Raises TimeoutError with a clear message instead
  of a confusing assert 0 > 0.

- Added proc cleanup to each finally block. Uses proc=None tracking
  after successful in-test termination to avoid redundant double-calls,
  and anyio.move_on_after() so cleanup timeout never masks the real
  failure.

- Removed the parent_marker check from test_basic_child_process_cleanup.
  NamedTemporaryFile(delete=False) already creates the file, so the
  os.path.exists() assertion was a no-op that never verified anything.
  Child writing proves parent started.

- Simplified shutdown verification: terminate_posix_process_tree already
  polls for process-group death, so the first post-termination sleep(0.5)
  was redundant. Reduced to a single 0.3s stability check (3x child
  write interval).

Result: 3 tests drop from ~7s to ~1.9s locally, 30/30 pass under
4-worker parallel load (flake-finder).

Closes #1775

Github-Issue: #1775
…ess probe

Replaces the file-growth-polling approach with a fundamentally different
design: each subprocess in the tree connects a TCP socket back to a listener
owned by the test. Liveness and death are then detected via blocking I/O
operations that unblock on kernel-level events — zero sleeps, zero polling:

  1. await listener.accept() blocks until the subprocess connects
     (proves it started).
  2. After _terminate_process_tree(), await stream.receive(1) raises
     EndOfStream because the kernel closes all file descriptors of a
     terminated process. This is direct, OS-guaranteed proof that the
     child is dead.

Compared to the original file-watching approach:
- No tempfiles, no file I/O, no path escaping
- No fixed sleep() durations (the root cause of flakiness)
- No polling loops
- Failure modes are clear: accept times out = child never started;
  receive times out = child survived termination (actual bug)

Also fixes the process leak on assertion failure by adding proc cleanup
to each finally block (guarded by proc=None after successful in-test
termination to avoid noisy double-terminate).

Result: 3 tests drop from ~7.1s to ~0.5s, 30/30 pass under 4-worker
parallel load. Removes 80 net lines and the tempfile/path-escape deps.

Closes #1775

Github-Issue: #1775
@maxisbey maxisbey changed the title Fix flaky TestChildProcessCleanup: poll for file growth, clean up proc in finally Rewrite TestChildProcessCleanup with socket-based deterministic liveness probe Mar 10, 2026
On Windows, TerminateJobObject causes an abrupt TCP RST (not clean FIN),
so anyio raises BrokenResourceError instead of EndOfStream when reading
from a socket held by the terminated child. Both exceptions indicate the
peer is dead — catch both.

Github-Issue: #1775
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix flaky tests in TestChildProcessCleanup (test_stdio.py)

1 participant