Rewrite TestChildProcessCleanup with socket-based deterministic liveness probe by maxisbey · Pull Request #2265 · modelcontextprotocol/python-sdk

maxisbey · 2026-03-10T14:30:43Z

Summary

Closes #1775. Supersedes #2071.

The three TestChildProcessCleanup tests in tests/client/test_stdio.py failed intermittently on Windows/macOS CI. The root cause was a fundamentally timing-dependent test design: watching a file grow on disk as a proxy for "is the child process alive?", with fixed anyio.sleep() durations to "wait long enough" for nested Python interpreters to start. On loaded CI, "long enough" wasn't.

A secondary bug compounded this: when the flaky assertion failed, proc was never terminated — the finally blocks only cleaned up tempfiles. The leaked subprocess was GC'd during a later test, triggering PytestUnraisableExceptionWarning there (observed: test_call_tool — an in-memory test that never touches subprocesses — failing with ResourceWarning: subprocess 5592 is still running).

New design — socket-based, zero sleeps, zero polling

Each subprocess in the tree connects a TCP socket back to a listener owned by the test. Two kernel-guaranteed blocking-I/O signals then tell us everything:

await listener.accept() blocks until the subprocess connects → proves it started. No polling for "did the file grow yet?".
await stream.receive(1) after _terminate_process_tree() raises anyio.EndOfStream because the kernel closes all file descriptors — including sockets — when a process terminates. This is a direct, OS-level proof that the child is dead. No polling for "did the file stop growing?".

Every synchronization point is a blocking read that unblocks on a kernel event. The anyio.fail_after() wrappers are upper bounds for catastrophic failure (subprocess never started / survived termination), not timing assumptions.

# Old: fixed sleep, indirect proxy, races against CI load
await anyio.sleep(0.5)
initial_size = os.path.getsize(marker_file)
await anyio.sleep(0.3)
assert os.path.getsize(marker_file) > initial_size  # fails: assert 0 > 0

# New: blocking accept, kernel-level EOF signal
with anyio.fail_after(10):
    stream = await sock.accept()           # blocks until child connects
    assert await stream.receive(5) == b"alive"

await _terminate_process_tree(proc)

with anyio.fail_after(5):
    await stream.receive(1)                # raises EndOfStream: child is dead

Secondary fix: process cleanup in finally

Each test now terminates proc in finally if the test body's own termination didn't run (guarded by proc = None after successful in-test termination to avoid noisy double-terminate). anyio.move_on_after(5) ensures cleanup can't mask the original failure.

Also removed

tempfile import and all file I/O
escape_path_for_python import (no longer need path escaping for embedded scripts — scripts are now injected via {code!r} repr() literals, which eliminates all nested-quote escaping)
~30 lines of boilerplate textwrap.dedent blocks with triply-nested f-string scripts
The parent_marker check from test 1, which was a no-op: NamedTemporaryFile(delete=False) creates the file, so os.path.exists() was always True

Results

Metric	Before	After
3 tests, single run	7.08s	0.46s (15× faster)
30 runs × 4 parallel workers (flake-finder)	—	2.08s, 30/30 pass
Full `./scripts/test`	—	100% coverage
Diff	—	+168 −248 (net −80 lines)
`anyio.sleep()` calls in TestChildProcessCleanup	12	0

_{AI Disclaimer}

…c in finally The three TestChildProcessCleanup tests were flaky on Windows/macOS CI because they used fixed anyio.sleep() durations (0.5s + 0.3s = 800ms) to wait for nested Python interpreter chains to start writing to marker files. On loaded CI runners, two Python startups + subprocess.Popen + open() + first write() can exceed 800ms, causing assert 0 > 0. Secondary bug: when the assertion failed, proc was never terminated (finally only cleaned up tempfiles). The leaked subprocess was GC'd during a later test, triggering PytestUnraisableExceptionWarning and causing knock-on failures in unrelated tests. Changes: - Added _wait_for_file_size() helper: polls until getsize(path) exceeds a threshold, bounded by anyio.fail_after(10). Replaces all startup sleep+assert chains. Raises TimeoutError with a clear message instead of a confusing assert 0 > 0. - Added proc cleanup to each finally block. Uses proc=None tracking after successful in-test termination to avoid redundant double-calls, and anyio.move_on_after() so cleanup timeout never masks the real failure. - Removed the parent_marker check from test_basic_child_process_cleanup. NamedTemporaryFile(delete=False) already creates the file, so the os.path.exists() assertion was a no-op that never verified anything. Child writing proves parent started. - Simplified shutdown verification: terminate_posix_process_tree already polls for process-group death, so the first post-termination sleep(0.5) was redundant. Reduced to a single 0.3s stability check (3x child write interval). Result: 3 tests drop from ~7s to ~1.9s locally, 30/30 pass under 4-worker parallel load (flake-finder). Closes #1775 Github-Issue: #1775

…ess probe Replaces the file-growth-polling approach with a fundamentally different design: each subprocess in the tree connects a TCP socket back to a listener owned by the test. Liveness and death are then detected via blocking I/O operations that unblock on kernel-level events — zero sleeps, zero polling: 1. await listener.accept() blocks until the subprocess connects (proves it started). 2. After _terminate_process_tree(), await stream.receive(1) raises EndOfStream because the kernel closes all file descriptors of a terminated process. This is direct, OS-guaranteed proof that the child is dead. Compared to the original file-watching approach: - No tempfiles, no file I/O, no path escaping - No fixed sleep() durations (the root cause of flakiness) - No polling loops - Failure modes are clear: accept times out = child never started; receive times out = child survived termination (actual bug) Also fixes the process leak on assertion failure by adding proc cleanup to each finally block (guarded by proc=None after successful in-test termination to avoid noisy double-terminate). Result: 3 tests drop from ~7.1s to ~0.5s, 30/30 pass under 4-worker parallel load. Removes 80 net lines and the tempfile/path-escape deps. Closes #1775 Github-Issue: #1775

On Windows, TerminateJobObject causes an abrupt TCP RST (not clean FIN), so anyio raises BrokenResourceError instead of EndOfStream when reading from a socket held by the terminated child. Both exceptions indicate the peer is dead — catch both. Github-Issue: #1775

maxisbey added 2 commits March 10, 2026 14:30

maxisbey changed the title ~~Fix flaky TestChildProcessCleanup: poll for file growth, clean up proc in finally~~ Rewrite TestChildProcessCleanup with socket-based deterministic liveness probe Mar 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite TestChildProcessCleanup with socket-based deterministic liveness probe#2265

Rewrite TestChildProcessCleanup with socket-based deterministic liveness probe#2265
maxisbey wants to merge 3 commits intomainfrom
fix/flaky-child-process-cleanup-polling

maxisbey commented Mar 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

maxisbey commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New design — socket-based, zero sleeps, zero polling

Secondary fix: process cleanup in finally

Also removed

Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

maxisbey commented Mar 10, 2026 •

edited

Loading