From a62fc6e7aaf087795053fb69a3181cc4d2580279 Mon Sep 17 00:00:00 2001 From: igerber Date: Sat, 14 Mar 2026 15:04:33 -0400 Subject: [PATCH 1/2] Fix overly strict AI PR review prompt causing excessive review cycles Address five failure modes observed after gpt-5.4 upgrade (PRs #192, #194, #195): documented deviations flagged as blockers, deferred work not accepted, moving goalposts on re-review, undefined approval criteria, and valid implementation choices treated as methodology errors. Changes to pr_review.md: - Exempt REGISTRY.md-documented deviations from P0/P1 (classify as P3) - Add implementation choice exception for valid numerical approaches - Add Deferred Work Acceptance section honoring TODO.md tracking - Add Assessment Criteria with objective verdicts and mitigation rules - Add Re-review Scope rules to prevent oscillation between rounds Changes to ai_pr_review.yml: - Add step to fetch previous AI review comment for re-review context - Inject prior review findings into compiled prompt on /ai-review reruns Co-Authored-By: Claude Opus 4.6 (1M context) --- .github/codex/prompts/pr_review.md | 58 ++++++++++++++++++++++++++++-- .github/workflows/ai_pr_review.yml | 29 +++++++++++++++ 2 files changed, 85 insertions(+), 2 deletions(-) diff --git a/.github/codex/prompts/pr_review.md b/.github/codex/prompts/pr_review.md index 3254afe..b5caab5 100644 --- a/.github/codex/prompts/pr_review.md +++ b/.github/codex/prompts/pr_review.md @@ -5,7 +5,14 @@ TOP PRIORITY: Methodology adherence to source material. - If the PR changes an estimator, math, weighting, variance/SE, identification assumptions, or default behaviors: 1) Identify which method(s) are affected. 2) Cross-check against the cited paper(s) and the Methodology Registry. - 3) Flag any mismatch, missing assumption check, incorrect variance/SE, or undocumented deviation as P0/P1. + 3) Flag any UNDOCUMENTED mismatch, missing assumption check, or incorrect variance/SE as P0/P1. + 4) If a deviation IS documented in REGISTRY.md (look for "**Note:**", "**Deviation from R:**", + "**Note (deviation from R):**" labels), it is NOT a defect. Classify as P3-informational + (P3 = minor/informational, no action required). + 5) Different valid numerical approaches to the same mathematical operation (e.g., Cholesky vs QR, + SVD vs eigendecomposition, multiplier vs nonparametric bootstrap) are implementation choices, + not methodology errors — unless the approach is provably wrong (produces incorrect results), + not merely different. SECONDARY PRIORITIES (in order): 2) Edge case coverage (see checklist below) @@ -47,10 +54,22 @@ When reviewing new features or code paths, specifically check: - Command to check: `grep -n "pattern" diff_diff/*.py` - Flag as P1 if only partial fixes were made +## Deferred Work Acceptance + +This project tracks deferred technical debt in `TODO.md` under "Tech Debt from Code Reviews." + +- If a limitation is already tracked in `TODO.md` with a PR reference, it is NOT a blocker. +- If a PR ADDS a new `TODO.md` entry for deferred work, that counts as properly tracking it. + Classify as P3-informational ("tracked in TODO.md"), not P1/P2. +- Only flag deferred work as P1+ if it introduces a SILENT correctness bug (wrong numbers + with no warning/error) that is NOT tracked anywhere. +- Test gaps, documentation gaps, and performance improvements are deferrable. Missing NaN guards + and incorrect statistical output are not. + Rules: - Review ONLY the changes introduced by this PR (diff) and the minimum surrounding context needed. - Provide a single Markdown report with: - - Overall assessment: ✅ Looks good | ⚠️ Needs changes | ⛔ Blocker + - Overall assessment (see Assessment Criteria below) - Executive summary (3–6 bullets) - Sections for: Methodology, Code Quality, Performance, Maintainability, Tech Debt, Security, Documentation/Tests - In each section: list findings with Severity (P0/P1/P2/P3), Impact, and Concrete fix. @@ -59,6 +78,41 @@ Rules: Output must be a single Markdown message. +## Assessment Criteria + +Apply the assessment based on the HIGHEST severity of UNMITIGATED findings: + +⛔ Blocker — One or more P0: silent correctness bugs (wrong statistical output with no + warning), data corruption, or security vulnerabilities. + +⚠️ Needs changes — One or more P1 (no P0s): missing edge-case handling that could produce + errors in production, undocumented methodology deviations, or anti-pattern violations. + +✅ Looks good — No unmitigated P0 or P1 findings. P2/P3 items may exist. A PR does NOT need + to be perfect to receive ✅. Tracked limitations, documented deviations, and minor gaps + are compatible with ✅. + +A finding is MITIGATED (does not count toward assessment) if: +- The deviation is documented in `docs/methodology/REGISTRY.md` with a Note/Deviation label +- The limitation is tracked in `TODO.md` under "Tech Debt from Code Reviews" +- The PR itself adds a TODO.md entry or REGISTRY.md note for the issue +- The finding is about an implementation choice between valid numerical approaches + +When the assessment is ⚠️ or ⛔, include a "Path to Approval" section listing specific, +enumerated changes that would move the assessment to ✅. Each item must be concrete and +actionable (not "improve testing" but "add test for X with input Y"). + +## Re-review Scope + +When this is a re-review (the PR has prior AI review comments): +- Focus primarily on whether PREVIOUS findings have been addressed. +- New P1+ findings on unchanged code MAY be raised but must be marked "[Newly identified]" + to distinguish from moving goalposts. Limit these to clear, concrete issues — not + speculative concerns or stylistic preferences. +- New code added since the last review IS in scope for new findings. +- If all previous P1+ findings are resolved, the assessment should be ✅ even if new + P2/P3 items are noticed. + ## Known Anti-Patterns Flag these patterns in new or modified code: diff --git a/.github/workflows/ai_pr_review.yml b/.github/workflows/ai_pr_review.yml index 483f5b8..eccab81 100644 --- a/.github/workflows/ai_pr_review.yml +++ b/.github/workflows/ai_pr_review.yml @@ -89,12 +89,32 @@ jobs: "${{ steps.pr.outputs.base_ref }}" \ +refs/pull/${{ steps.pr.outputs.number }}/head + - name: Fetch previous AI review (if any) + id: prev_review + uses: actions/github-script@v7 + with: + script: | + const { owner, repo } = context.repo; + const issue_number = Number('${{ steps.pr.outputs.number }}'); + const comments = await github.paginate(github.rest.issues.listComments, { + owner, repo, issue_number, per_page: 100, + }); + const aiComments = comments.filter(c => + (c.body || "").includes("