diff --git a/.claude/commands/dev-checklists.md b/.claude/commands/dev-checklists.md index 342118a..a4dc87e 100644 --- a/.claude/commands/dev-checklists.md +++ b/.claude/commands/dev-checklists.md @@ -46,9 +46,13 @@ When implementing or modifying code that affects statistical methodology (estima - [ ] For edge cases: either match reference OR document deviation 3. **When deviating from reference implementations**: - - [ ] Add a **Note** in the Methodology Registry explaining the deviation + - [ ] Add entry in `docs/methodology/REGISTRY.md` using a reviewer-recognized label: + `**Note:**`, `**Deviation from R:**`, or `**Note (deviation from R):**` + (see CLAUDE.md "Documenting Deviations" for full format reference) - [ ] Include rationale (e.g., "defensive enhancement", "R errors here") - [ ] Ensure the deviation is an improvement, not a bug + - [ ] If deferring P2/P3 work: add row to `TODO.md` table under "Tech Debt from Code + Reviews" with columns `Issue | Location | PR | Priority` 4. **Testing methodology-aligned behavior**: - [ ] Test that edge cases produce documented behavior (NaN, warning, etc.) diff --git a/.claude/commands/pre-merge-check.md b/.claude/commands/pre-merge-check.md index fca6e07..0454d83 100644 --- a/.claude/commands/pre-merge-check.md +++ b/.claude/commands/pre-merge-check.md @@ -107,6 +107,20 @@ git diff HEAD -- | grep "^+.*def " | head -10 For each changed function, flag: "Verify docstring Parameters section matches updated signature for: ``" +#### 2.5 Methodology Documentation Check + +If any methodology files changed, check whether `docs/methodology/REGISTRY.md` was also +modified in the changed file set (from Section 1). + +If methodology files changed but REGISTRY.md was NOT modified, flag: +"Methodology files changed but `docs/methodology/REGISTRY.md` was not updated. If your +changes deviate from reference implementations, document them using a reviewer-recognized +label (`**Note:**`, `**Deviation from R:**`, or `**Note (deviation from R):**`) — +undocumented deviations are flagged as P1 by the AI reviewer and cannot be mitigated +by TODO.md." + +This is a WARNING, not a blocker — not every methodology change involves a deviation. + ### 3. Display Context-Specific Checklist Based on what changed, display the appropriate checklist items: @@ -134,6 +148,11 @@ Based on your changes to: - [ ] Control group composition verified for new code paths - [ ] "Not-yet-treated" excludes the treatment cohort itself - [ ] Parameter interactions tested with all aggregation methods + +### Methodology Deviation Documentation +- [ ] If deviating from reference implementation: added a reviewer-recognized label + (`**Note:**`, `**Deviation from R:**`, or `**Note (deviation from R):**`) in REGISTRY.md +- [ ] No undocumented methodology deviations (AI reviewer flags these as P1) ``` #### If Documentation Files Changed @@ -141,7 +160,6 @@ Based on your changes to: ### Documentation Sync - [ ] Docstrings updated for changed function signatures - [ ] README updated if user-facing behavior changes -- [ ] REGISTRY.md updated if methodology edge cases change ``` #### If This Appears to Be a Bug Fix diff --git a/.github/codex/prompts/pr_review.md b/.github/codex/prompts/pr_review.md index 3254afe..4cc9ea4 100644 --- a/.github/codex/prompts/pr_review.md +++ b/.github/codex/prompts/pr_review.md @@ -5,7 +5,14 @@ TOP PRIORITY: Methodology adherence to source material. - If the PR changes an estimator, math, weighting, variance/SE, identification assumptions, or default behaviors: 1) Identify which method(s) are affected. 2) Cross-check against the cited paper(s) and the Methodology Registry. - 3) Flag any mismatch, missing assumption check, incorrect variance/SE, or undocumented deviation as P0/P1. + 3) Flag any UNDOCUMENTED mismatch, missing assumption check, or incorrect variance/SE as P0/P1. + 4) If a deviation IS documented in REGISTRY.md (look for "**Note:**", "**Deviation from R:**", + "**Note (deviation from R):**" labels), it is NOT a defect. Classify as P3-informational + (P3 = minor/informational, no action required). + 5) Different valid numerical approaches to the same mathematical operation (e.g., Cholesky vs QR, + SVD vs eigendecomposition, multiplier vs nonparametric bootstrap) are implementation choices, + not methodology errors — unless the approach is provably wrong (produces incorrect results), + not merely different. SECONDARY PRIORITIES (in order): 2) Edge case coverage (see checklist below) @@ -47,10 +54,23 @@ When reviewing new features or code paths, specifically check: - Command to check: `grep -n "pattern" diff_diff/*.py` - Flag as P1 if only partial fixes were made +## Deferred Work Acceptance + +This project tracks deferred technical debt in `TODO.md` under "Tech Debt from Code Reviews." + +- If a limitation is already tracked in `TODO.md` with a PR reference, it is NOT a blocker. +- If a PR ADDS a new `TODO.md` entry for deferred work, that counts as properly tracking + deferrable items (test gaps, documentation, performance). Classify those as + P3-informational ("tracked in TODO.md"), not P1/P2. +- Only flag deferred work as P1+ if it introduces a SILENT correctness bug (wrong numbers + with no warning/error) that is NOT tracked anywhere. +- Test gaps, documentation gaps, and performance improvements are deferrable. Missing NaN guards + and incorrect statistical output are not. + Rules: - Review ONLY the changes introduced by this PR (diff) and the minimum surrounding context needed. - Provide a single Markdown report with: - - Overall assessment: ✅ Looks good | ⚠️ Needs changes | ⛔ Blocker + - Overall assessment (see Assessment Criteria below) - Executive summary (3–6 bullets) - Sections for: Methodology, Code Quality, Performance, Maintainability, Tech Debt, Security, Documentation/Tests - In each section: list findings with Severity (P0/P1/P2/P3), Impact, and Concrete fix. @@ -59,6 +79,47 @@ Rules: Output must be a single Markdown message. +## Assessment Criteria + +Apply the assessment based on the HIGHEST severity of UNMITIGATED findings: + +⛔ Blocker — One or more P0: silent correctness bugs (wrong statistical output with no + warning), data corruption, or security vulnerabilities. + +⚠️ Needs changes — One or more P1 (no P0s): missing edge-case handling that could produce + errors in production, undocumented methodology deviations, or anti-pattern violations. + +✅ Looks good — No unmitigated P0 or P1 findings. P2/P3 items may exist. A PR does NOT need + to be perfect to receive ✅. Tracked limitations, documented deviations, and minor gaps + are compatible with ✅. + +A finding is MITIGATED (does not count toward assessment) if: +- The deviation is documented in `docs/methodology/REGISTRY.md` with a Note/Deviation label +- The limitation is tracked in `TODO.md` under "Tech Debt from Code Reviews" +- The PR itself adds a TODO.md entry or REGISTRY.md note for the issue +- The finding is about an implementation choice between valid numerical approaches + +A finding is NEVER mitigated by TODO.md tracking if it is: +- A P0: silent correctness bug, NaN/inference inconsistency, data corruption, or security issue +- A P1: missing assumption check, incorrect variance/SE, or undocumented methodology deviation +Only P2/P3 findings (code quality, test gaps, documentation, performance) can be downgraded +by tracking in TODO.md. + +When the assessment is ⚠️ or ⛔, include a "Path to Approval" section listing specific, +enumerated changes that would move the assessment to ✅. Each item must be concrete and +actionable (not "improve testing" but "add test for X with input Y"). + +## Re-review Scope + +When this is a re-review (the PR has prior AI review comments): +- Focus primarily on whether PREVIOUS findings have been addressed. +- New P1+ findings on unchanged code MAY be raised but must be marked "[Newly identified]" + to distinguish from moving goalposts. Limit these to clear, concrete issues — not + speculative concerns or stylistic preferences. +- New code added since the last review IS in scope for new findings. +- If all previous P1+ findings are resolved, the assessment should be ✅ even if new + P2/P3 items are noticed. + ## Known Anti-Patterns Flag these patterns in new or modified code: diff --git a/.github/workflows/ai_pr_review.yml b/.github/workflows/ai_pr_review.yml index 483f5b8..cbdf587 100644 --- a/.github/workflows/ai_pr_review.yml +++ b/.github/workflows/ai_pr_review.yml @@ -89,12 +89,33 @@ jobs: "${{ steps.pr.outputs.base_ref }}" \ +refs/pull/${{ steps.pr.outputs.number }}/head + - name: Fetch previous AI review (if any) + id: prev_review + uses: actions/github-script@v7 + with: + script: | + const { owner, repo } = context.repo; + const issue_number = Number('${{ steps.pr.outputs.number }}'); + const comments = await github.paginate(github.rest.issues.listComments, { + owner, repo, issue_number, per_page: 100, + }); + const aiComments = comments.filter(c => + (c.body || "").includes("