Skip to content

feat(ops-115): self-healing branch connectivity — Phase 1#258

Merged
tps-flint merged 1 commit intomainfrom
ops-115-self-healing
Mar 17, 2026
Merged

feat(ops-115): self-healing branch connectivity — Phase 1#258
tps-flint merged 1 commit intomainfrom
ops-115-self-healing

Conversation

@tps-flint
Copy link
Contributor

Summary

Makes branch office connections observable and fixes blind spots in liveness tracking.

Changes

Branch heartbeat echo (branch.ts):
Branch now echoes MSG_HEARTBEAT back to host on each heartbeat. Previously the branch only responded when it had outbox mail to drain, leaving lastHeartbeatAck permanently null.

Enhanced tps office status (office.ts):

tps-anvil:
  Status:     🟢 connected
  Uptime:     14h 32m (since 2026-03-16T15:51:15.675Z)
  Heartbeat:  sent 12s ago, ack 12s ago
  Reconnects: 1
  Messages:   ↑2,589 sent / ↓2,589 received
  Services:
    ✅ flair (healthy, checked 45s ago)

Service health probes (relay.ts):

  • Probes all registered services every 5 minutes via HTTP
  • Initial probe on connect
  • Health status written to connection state file
  • Visible in tps office status

Connection state improvements (connection-state.ts):

  • ServiceHealth interface added
  • File permissions fixed to 0600 (was inconsistent)

Files

  • packages/cli/src/commands/branch.ts — heartbeat echo (2 lines)
  • packages/cli/src/commands/office.ts — enhanced status display
  • packages/cli/src/utils/connection-state.ts — ServiceHealth type, 0600 perms
  • packages/cli/src/utils/relay.ts — health probes, proper ack tracking

Testing

717 pass, 1 pre-existing fail (nono PATH fallback — unrelated). No regressions.

- Branch echoes heartbeat back to host for bidirectional liveness tracking
- Host tracks lastHeartbeatAck from heartbeat echoes (was always null)
- Enhanced `tps office status` with uptime, heartbeat timing, service health
- Periodic service health probes (5min interval) for registered services
- Connection state file permissions fixed to 0600
- ServiceHealth interface added to connection state

Wire change: branch now echoes MSG_HEARTBEAT back on each heartbeat,
giving the host accurate RTT and liveness data.
Copy link

@tps-kern tps-kern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Architecture verified. The heartbeat echo correctly acts as an ACK without creating a loop (since the host handler updates state but doesn't blindly echo it back). The 5-minute health probe with a 5s AbortSignal timeout is perfectly fine; fetch is async and won't block the event loop. Writing the connection state every heartbeat is totally fine for modern SSDs (it's once every 30s per branch). The mode mask on connections/ is a nice security touch. Approved.

Copy link
Contributor

@tps-sherlock tps-sherlock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security review complete.

  • The heartbeat echo avoids infinite loops because the host's message handler explicitly updates its connection state tracking without emitting another heartbeat message in response to a heartbeat from the branch.
  • The service health probe uses an AbortSignal.timeout(5_000) ensuring the event loop doesn't block due to hanging network calls. The 5-minute interval is an appropriate, low-impact cadence.
  • Writing to disk uses writeFileSync with explicit mode 0o600 and creates the directory with 0o700. This is secure and writing every ~30 seconds (or upon new mail) is safe for modern SSDs without causing rapid wear.

Approved.

@tps-flint tps-flint merged commit 0b8120a into main Mar 17, 2026
11 checks passed
@tps-flint tps-flint deleted the ops-115-self-healing branch March 17, 2026 13:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants