· ✦ · ✦ ·
✦ · ⚡ · ✦
░░▒▓████▓▒░░
▒▓█▀ ▀█▓▒
▓█ ◆ ◆ █▓
██ ╲ ╱ ██
▓█ ═══⚒═══ █▓
▒▓█▄ ▄█▓▒
░░▒▓████▓▒░░
▓██▓
╔═══╧══╧═══╗
║ THE FORGE ║
╚══════════╝
▄▄████████████▄▄
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
A task loop with KPI guardrails for Claude Code and Codex/manual workflows.
Forge is a protocol plus adapters. It takes open-text software tasks, keeps coverage/speed/quality as guardrails, records state across iterations, and runs until the work is honestly done or you stop it.
You: /forge "password reset flow" --done-when "users can request and complete a reset end-to-end" --coverage 90 --speed -30%
Forge: Measuring baseline... 85.2% coverage, 120s
Success contract: password reset works end-to-end
Strategy: coverage-push → 15 tests for edge cases
85.8% (+0.6%), 118s (-2s) ✓
...iterates until task success and KPI targets are both satisfied...
Forge is for cases where plain prompting is too loose but a full agent framework is too heavy.
- Give it a task
- optionally say what "done" means
- keep tests, coverage, speed, and quality in view
- iterate with recorded state instead of re-explaining yourself every round
The portable part of the system:
- iteration protocol (Orient → Measure → Evaluate → Decide → Execute → Verify → Record → Complete)
- task-driven success contract with optional explicit
done_when - state format and autoregressive memory
- KPI targets (coverage, speed, quality)
- strategy selection and stagnation handling
- lessons and ideas backlog
The bundled runtime adapter in this repo:
/forgecommand/forge-cancelcommand/forge-statuscommandagents/forge.mdhooks/stop-hook.sh- install script that wires those assets into
~/.claude/
The bundled Codex/manual adapter in this repo:
install-codex.shdrivers/codex/bin/forge-initdrivers/codex/bin/forge-continuedrivers/codex/bin/forge-canceldrivers/codex/bin/forge-status.codex/forge/state layout for per-project sessions- shared shell state helpers reused across drivers
Both drivers are first-class. The difference is automation depth: Claude gets hook-driven iteration; Codex gets manual driver scripts that print the next prompt and manage session state.
| Environment | Status | What is actually shipped |
|---|---|---|
| Claude Code | First-class | Command, agent, stop-hook driver, installer |
| Codex CLI | First-class manual driver | Install script, forge-init, forge-continue, forge-cancel, project-local state |
| Other agents / plain shell | Protocol-only | Reuse the protocol and state model manually |
Forge is not claiming native parity across agent runtimes. It ships two real drivers with different control surfaces.
Forge is not pretending to emerge from nowhere.
- Ralph Wiggum — Geoff Huntley gave the core loop shape: fresh context, file-backed iteration, and the willingness to let simple loops do real work.
- autoresearch — Andrej Karpathy reinforced the deletion bias, binary keep/discard discipline, and the value of tiny, explicit skills.
- pi-autoresearch — Tobi Lutke and David Cortes pushed the pattern toward measurable software work beyond ML and made the backlog / measurement story sharper.
- SICA — Self-Improving Coding Agent showed that compounding improvement works better when strategy selection learns from prior evidence.
- autoresearch-mlx — trevin-creator showed the loop itself can be a target of improvement, not just the code under test.
Forge’s job is not to erase those influences. It is to package them into a cleaner, more practical tool surface.
Each iteration executes one complete eight-phase cycle:
| Phase | What happens |
|---|---|
| A. Orient | Read forge-state file, check task success contract + KPI trends + stagnation count |
| B. Measure | Run tests with coverage, capture KPIs |
| C. Evaluate | Every 3rd iteration: spawn fresh-context subagent for unbiased audit |
| D. Decide | Pick strategy from KPI gaps + findings + lessons |
| E. Execute | Apply ONE focused transformation |
| F. Verify | Tests must be green, re-measure KPIs |
| G. Record | Update forge-state with deltas + lessons (the autoregressive step) |
| H. Complete | Task success contract satisfied and KPI targets met? Done. Otherwise, next iteration. |
Forge is built for open-text work, not just KPI chasing.
- The task scope is the primary objective.
--done-when "TEXT"is an optional explicit success override.- If
--done-whenis omitted, Forge derives concrete completion checks from the task scope and records them in Forge state. - Coverage, speed, and quality stay as guardrails alongside the task itself.
- Completion means both the task and the guardrails are satisfied.
Forge selects from named strategies based on which KPI gap is largest:
| Strategy | When | Impact |
|---|---|---|
coverage-push |
Clear coverage gaps | Coverage |
refactor-for-testability |
Code hard to test | Coverage |
component-extraction |
DRY violations, repeated patterns | Coverage + Quality |
speed-optimization |
Slow tests, sync overuse | Speed |
dead-code-removal |
Unused code flagged by evaluation | Quality + Coverage |
quality-polish |
Naming, complexity, clarity | Quality |
design-system |
Duplicated UI patterns | Quality + Coverage |
simplification |
Code that can be made simpler | Quality |
When coverage improves by less than 0.1% for two consecutive iterations, forge increments a stagnation counter. Once the counter reaches 3, forge automatically rotates to a different strategy — the historically most effective one, or an untried one. No manual intervention needed.
Every 3rd iteration, Forge runs a fresh-context audit pass. In Claude Code this is typically a subagent; in other environments it may be an isolated reviewer or manual second pass. The protocol requires fresh context, not a specific vendor primitive.
git clone https://github.com/DjinnFoundry/forge-loop.git
cd forge-loop
./install.shThe installer symlinks the Claude Code driver assets into your ~/.claude/ directory.
Important: You also need to configure the stop hook that drives iteration. See hooks/README.md for setup instructions.
mkdir -p ~/.claude/skills/forge ~/.claude/commands ~/.claude/agents ~/.claude/hooks
cp skills/forge/SKILL.md ~/.claude/skills/forge/SKILL.md
cp commands/forge.md ~/.claude/commands/forge.md
cp commands/forge-cancel.md ~/.claude/commands/forge-cancel.md
cp commands/cancel-ralph.md ~/.claude/commands/cancel-ralph.md
cp commands/forge-status.md ~/.claude/commands/forge-status.md
cp agents/forge.md ~/.claude/agents/forge.md
cp hooks/stop-hook.sh ~/.claude/hooks/stop-hook.sh
# Stop hook — see hooks/README.md for settings.json setupgit clone https://github.com/DjinnFoundry/forge-loop.git
cd forge-loop
./install-codex.shThe Codex installer links Forge Core into ~/.codex/skills/forge/ and installs
driver entrypoints into ~/.codex/bin/.
Codex support is manual by design, but it is now a real shipped driver.
Typical flow:
- Run
forge-init "scope" [--done-when "TEXT"] ...in the target project. - Paste the printed prompt into Codex.
- After each iteration, run
forge-continueto print the next prompt. - Use
forge-statusto inspect the active session. - Use
forge-cancelto stop the active loop while preserving Forge state.
This is a first-class manual driver, not a hook-based runtime integration.
Driver safety:
forge-continuederives the next iteration from recorded Forge state entries- multiple active Codex sessions require an explicit session id instead of implicit selection
forge-statusis read-only and reports the next required iteration from Forge state
/forge "LiveView components" --coverage 95 --speed -20%
/forge "password reset flow" --done-when "users can request, receive, and complete a reset end-to-end" --coverage 90 --quality strict
/forge "SCOPE" [--done-when "TEXT"] --coverage N --speed -N% --quality strict|moderate|lax --max-iterations N
| Option | Default | Description |
|---|---|---|
SCOPE |
(required) | What to improve — quoted string |
--done-when "TEXT" |
task-derived | Explicit success contract. If omitted, derive completion checks from the task itself |
--coverage N |
baseline + 2 | Minimum coverage % target |
--speed -N% |
-20% | Speed reduction from baseline |
--quality |
moderate | strict (0 high, 0 med) / moderate (0 high, ≤3 med) / lax (0 high, ≤5 med) |
--max-iterations |
20 | Safety limit |
- Pause: Forge outputs
FORGE_PAUSEwhen it needs your input - Cancel:
/forge-cancelstops the loop - Status:
/forge-statusreports the current Claude driver session state - Inspect state:
.claude/forge-state.SESSION.mdis preserved when you pause or cancel
Use the same protocol phases and state format, but drive the loop yourself. Today that means:
- no bundled driver beyond Claude Code and Codex
- no automatic hook/runtime integration outside Claude Code
- no runtime-specific install story beyond the shipped drivers
Forge persists its state in driver-specific roots:
- Claude Code:
.claude/forge-state.SESSION.md - Codex:
.codex/forge/forge-state.SESSION.md
Claude’s loop driver uses .claude/forge-loop.SESSION.local.md as the primary
loop-state file name. Legacy .claude/ralph-loop.SESSION.local.md files are
still accepted for compatibility.
Other runtimes can reuse the same format in a different state root. Each iteration appends its KPIs, strategy, actions, and lessons. This is the autoregressive memory.
---
session_id: "0320-1430-a3b2"
scope: "API controllers"
success:
mode: "task-derived"
task: "API controllers"
done_when: null
completion_checks:
- "controller edge cases covered and passing"
- "no controller path regresses current behavior"
baseline:
coverage: 85.2
speed_seconds: 120
tests: 1250
failures: 0
measured_at: "2026-03-20T14:30:00Z"
targets:
min_coverage: 90.0
max_speed_seconds: 84
quality: "moderate"
max_iterations: 20
current_strategy: "component-extraction"
stagnation_count: 0
strategies_tried:
- name: "coverage-push"
iterations: [1, 2]
coverage_delta: 0.8
speed_delta: -5
lessons:
- "async:true on controller tests saves ~3s per file"
ideas:
- "auth module has dead code paths worth investigating"
---
## Iteration 1 — coverage-push
- Coverage: 85.2 → 85.8 (+0.6%)
- Speed: 120s → 118s (-2s)
- Tests: 1250 → 1265 (+15)
- Actions: Added 15 tests for data_loaders edge cases
- Reality-check: 2 high, 3 medium findings
- Lesson: "7 identical try-rescue blocks — extract, don't test each"forge-loop/
├── skills/forge/SKILL.md ← The protocol (source of truth)
├── commands/forge.md ← Claude Code /forge command
├── commands/forge-cancel.md ← Primary Claude stop command
├── commands/cancel-ralph.md ← Legacy alias for compatibility
├── commands/forge-status.md ← Shows Claude driver session status
├── drivers/codex/ ← Codex/manual driver scripts + prompt template
│ ├── bin/
│ │ ├── forge-init
│ │ ├── forge-continue
│ │ ├── forge-cancel
│ │ └── forge-status
│ ├── lib.sh
│ ├── prompt.md
│ └── README.md
├── agents/forge.md ← Subagent for spawning forge on subsystems
├── hooks/ ← Iteration engine
│ ├── README.md ← Hook setup instructions
│ └── stop-hook.sh ← Stop hook script
├── install.sh ← Installer script
├── install-codex.sh ← Codex driver installer
├── scripts/forge-state-lib.sh ← Shared shell state helpers
├── tests/
│ ├── stop-hook.test.sh
│ └── codex-driver.test.sh
├── CHANGELOG.md
├── CONTRIBUTING.md
└── README.md
The runtime layout is intentionally asymmetric: the protocol is portable, while drivers map that protocol to their runtime's real affordances. The Claude driver uses a stop hook and loop-state files. The Codex driver uses explicit shell entrypoints and project-local state files. Both preserve the same Forge Core semantics.
Distilled from Ralph, autoresearch, pi-autoresearch, SICA, and a dozen related loops:
- Loops are simple. The magic is in the loop. The universal pattern is: Modify, Measure, Compare, Keep/Discard, Record, Repeat. Everything else is details.
- Simpler is better. Code deletion at same KPIs is always a win. Don't add complexity for marginal gains.
- Autonomy scales when you constrain scope, clarify success, and mechanize verification. Tests aren't just QA — they're the rails the loop runs on.
- Binary keep/discard. Improved? Keep. Didn't? Revert. No gray area, no partial credit.
- State survives context. The forge-state file is the autoregressive memory. It survives context compaction, agent restarts, and session swaps.
- Fresh eyes beat anchored ones. Subagents with no iteration context prevent "the numbers look fine" bias.
- Think harder, don't stop. When stuck: re-read code, review backlog, combine near-misses, try the inverse, try simplification. Never pause to ask.
- Each improvement should make future improvements easier. (Addy Osmani)
| Aspect | Raw loop | Forge |
|---|---|---|
| KPI tracking | Ad-hoc | Structured state file with deltas + trends |
| Strategy | Single prompt | 8 named strategies, auto-rotation on stagnation |
| Evaluation | Self-evaluation (anchoring bias) | Fresh-context audits every 3 iterations |
| Memory | Context window only | Persistent state file survives compaction |
| Completion | Manual / hope | Exact completion marker after task success plus protocol checks |
| Lessons | Lost between iterations | Accumulated, inform strategy selection |
| Stagnation | Repeats same approach | Detects + rotates after low-delta iterations |
| Portability | Rebuild per runtime | Portable protocol, Claude and Codex drivers bundled |
- Forge packages proven loop patterns into a reusable protocol with first-class Claude Code and Codex/manual drivers.
- Forge improves repeatability versus ad-hoc prompting when you care about task success, KPI guardrails, iteration memory, and strategy rotation.
- Forge does not yet provide universal runtime adapter parity beyond the shipped drivers.
- Forge is more preconfigured than raw hooks. It is not a new primitive.
- Claude Code CLI
jq(for the stop hook)- A project with a test suite that reports coverage
- Codex CLI
jq- A project with a test suite that reports coverage
~/.codex/binon yourPATHif you want driver commands globally available
- Any agent/runtime that can follow the Forge protocol manually
- Some place to persist Forge state between iterations
- A project with a measurable test/quality loop
The skill includes test runner examples for multiple languages (Elixir, Python, JavaScript, Ruby, Go). To adapt:
- Edit
skills/forge/SKILL.md— update the MEASURE phase for your test runner - Update the coverage/speed parsing for your output format
- Everything else (strategies, stagnation, state format) is language-agnostic
See CONTRIBUTING.md.