Optimize CallawaySantAnna covariate path (5-19x speedup) by igerber · Pull Request #194 · igerber/diff-diff

igerber · 2026-03-08T20:30:37Z

Summary

Cache Cholesky factorizations of X^T X across (g,t) pairs sharing the same control design matrix, eliminating redundant OLS solves in the covariate regression path
Cache propensity score logistic regression coefficients for IPW/DR methods, avoiding redundant BFGS optimization
Add _compute_all_att_gt_covariate_reg() batched method for the reg+covariates path, mirroring the existing _compute_all_att_gt_vectorized() pattern
Extend benchmark and validation suites with 10-covariate, DR, and IPW scenarios

Speedups at 5K units:

Scenario	Before	After	Speedup
2 covariates, reg	0.158s	0.028s	5.6x
10 covariates, reg	0.535s	0.042s	12.7x
2 covariates, DR	0.373s	0.059s	6.3x
10 cov, reg, 50K units	4.430s	0.230s	19.3x

Methodology references (required if estimator / math changes)

Method name(s): Callaway & Sant'Anna (2021) ATT(g,t) estimation
Paper / source link(s): Callaway, B., & Sant'Anna, P. H. (2021). Difference-in-Differences with multiple time periods. Journal of Econometrics, 225(2), 200-230.
Any intentional deviations from the source (and why): None — Cholesky factorization caching is a numerical optimization that produces identical results to lstsq (validated to 1e-10 tolerance). Falls back to lstsq on singular/ill-conditioned matrices.

Validation

Tests added/updated: benchmarks/speed_review/bench_callaway.py, benchmarks/speed_review/validate_results.py
96/96 staggered tests pass, 1489/1489 full suite tests pass
5-scenario numerical validation (reg/DR/IPW × 0/2/10 covariates) matches baselines saved before optimization

Security / privacy

Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

…hing Cache Cholesky factorizations of X^T X across (g,t) pairs that share the same control design matrix, eliminating redundant OLS solves. Cache propensity score coefficients for IPW/DR methods. Achieves 5-19x speedup on covariate regression path depending on k. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-03-08T20:39:38Z

Overall assessment: ⛔ Blocker

Executive summary

This PR affects CallawaySantAnna’s covariate-adjusted ATT(g,t) paths, especially estimation_method="dr" and estimation_method="reg".
P0: the new DR Cholesky cache key is too coarse for control_group="not_yet_treated", so cached outcome regressions can be reused across different cohort-specific control sets.
P1: rank-deficient covariate handling no longer matches the Methodology Registry / public rank_deficient_action contract; the fast path skips QR rank checks and bypasses solve_ols.
The new benchmark validator would not catch either issue: it never exercises control_group="not_yet_treated" or rank-deficient covariates, and it barely checks inference outputs.
No direct security issues stand out.

Methodology

Severity: P0 | Impact: The DR cache breaks the documented not_yet_treated control definition. cho_key is set to (base_period_val, t) for all non-never_treated cases, but under control_group="not_yet_treated" the control design also depends on cohort g because cohort g must be excluded from controls for each ATT(g,t). That means the cached Cholesky factor can be reused across different cohorts that share the same (base, t) under the default base_period="varying", giving the wrong outcome-regression fit for at least the lead / pre-treatment comparisons. This violates the registry’s control-mask rule and the paper’s per-ATT(g,t) group-time setup. docs/methodology/REGISTRY.md:395, diff_diff/staggered.py:315, diff_diff/staggered.py:567, diff_diff/staggered.py:1137, diff_diff/staggered.py:1529. The source paper explicitly frames this estimator around group-time ATT(g,t) objects with OR/IPW/DR estimation, so cache reuse is only valid when the comparison design is unchanged. citeturn2view0
Concrete fix: include g in cho_key whenever control membership depends on cohort (at minimum all control_group="not_yet_treated" cases), or disable DR Cholesky caching for that mode; add a regression test comparing cached vs uncached DR outputs for control_group="not_yet_treated".
Severity: P1 | Impact: Rank-deficient handling no longer matches the documented methodology/default behavior. The registry and public docstring say rank_deficient_action="warn" should detect collinearity via pivoted QR and return R-style dropped-column / NA behavior, but _linear_regression() now sets skip_rank_check=True for every non-"error" case, and the new covariate-reg fast path bypasses solve_ols entirely via Cholesky / raw lstsq. That silently changes estimator behavior for collinear or near-collinear covariates instead of honoring the documented contract. docs/methodology/REGISTRY.md:355, diff_diff/staggered.py:210, diff_diff/staggered.py:138, diff_diff/staggered.py:933, diff_diff/linalg.py:384.
Concrete fix: preserve QR/SVD rank detection for warn and silent, and only take the Cholesky fast path after a verified full-rank check at the same 1e-07 tolerance; otherwise fall back to solve_ols with the requested rank_deficient_action.

Code Quality

Severity: P2 | Impact: _compute_all_att_gt_covariate_reg() silently drops problematic (g,t) cells with continue when beta, predictions, or residuals become non-finite. That mutates the set of estimated cells and therefore downstream aggregation weights instead of surfacing an explicit warning / NaN result. diff_diff/staggered.py:951, diff_diff/staggered.py:964, diff_diff/staggered.py:1180.
Concrete fix: preserve the cell with effect=np.nan, se=np.nan, and NaN inference fields (plus a warning), or fall back to the scalar path rather than silently removing it.

Performance

Severity: P2 | Impact: The speedup currently relies on unchecked invariants rather than proving them. The bad DR cache key and the skipped rank checks mean part of the benchmark win comes from dropping correctness safeguards, so the performance numbers are not decision-useful yet for all supported modes. diff_diff/staggered.py:138, diff_diff/staggered.py:567, diff_diff/staggered.py:933.
Concrete fix: keep the optimization, but gate cache reuse on explicit design-equivalence checks and preserve the old solver semantics before advertising the speed path as methodology-preserving.

Maintainability

Severity: P2 | Impact: The new _compute_all_att_gt_covariate_reg() duplicates large parts of _outcome_regression() / _linear_regression() instead of sharing a single cached-solver abstraction, and it has already drifted on rank handling and failure behavior. diff_diff/staggered.py:785, diff_diff/staggered.py:1324.
Concrete fix: factor a shared OLS helper that accepts an optional cached factorization but still centralizes rank checks, fallbacks, warnings, and influence-function construction.

Tech Debt

Severity: P2 | Impact: The migration from unit IDs to index arrays is incomplete. Producers now emit treated_idx / control_idx, but fallback branches still reference treated_units / control_units, leaving two incompatible internal schemas. diff_diff/staggered.py:598, diff_diff/staggered_aggregation.py:210, diff_diff/staggered_bootstrap.py:180.
Concrete fix: finish the migration in the fallback branches, or explicitly require precomputed in those paths so dead compatibility code cannot silently rot.

Security

No material findings.

Documentation/Tests

Severity: P2 | Impact: The added validation is benchmark-oriented, not methodology-oriented. It only covers five scenarios, all with the default control-group path; it never exercises control_group="not_yet_treated" or rank-deficient covariates, and it does not compare overall_p_value, overall_ci, event-study outputs, or group aggregations. The 0.01 SE tolerance is also loose enough to miss meaningful inference drift. benchmarks/speed_review/validate_results.py:99, benchmarks/speed_review/validate_results.py:144, benchmarks/speed_review/validate_results.py:169.
Concrete fix: add real regression tests for not_yet_treated + DR + covariates, rank_deficient_action parity (warn / silent / error), and NaN/Inf bootstrap edge cases; extend snapshot checks to all inference fields and all aggregation objects.

Address PR #194 review feedback: - Include cohort g in cho_key and control_key for not_yet_treated control group, preventing stale Cholesky/matrix reuse across groups - Add UserWarning when group-time cells are dropped due to non-finite regression results in covariate reg path - Add not_yet_treated validation scenarios (dr_2cov_nyt, reg_2cov_nyt) - Add regression test for DR/reg + covariates + not_yet_treated Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

igerber · 2026-03-08T21:28:29Z

/ai-review

github-actions · 2026-03-08T21:36:55Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 01688aeb711828128409c4bb7f92177e90ea6142

Overall assessment: ⚠️ Needs changes

Executive summary

Affected methods: CallawaySantAnna ATT(g,t) estimation in the covariate-adjusted reg path, the DR first-stage outcome regression, and the influence-function bootstrap/aggregation plumbing.
I do not see a direct estimand-level conflict with Callaway-Sant’Anna or the official did::att_gt() options themselves; the problem is that the optimization changes documented edge-case behavior around those methods, especially rank-deficient covariate designs. citeturn1search1turn1search3
P1: the new default covariate path bypasses the library’s required pivoted-QR rank-deficiency handling, so collinear covariates can now silently change ATT(g,t) behavior instead of warning and dropping redundant columns.
P1: the migration from unit-id IF payloads to index arrays is incomplete; fallback consumers still read removed keys and can now fail if they are reached.
P2: the new validation coverage is too weak for the changed code paths; it does not check most of the new aggregation/bootstrap/inference outputs and misses the default rank_deficient_action="warn" mode.

Methodology
Affected methods are the covariate-adjusted Callaway-Sant’Anna paths (reg, dr, ipw) and the control_group / base_period variants exposed by did::att_gt(). The paper/docs support those options; the issue here is not the caching idea, but the implementation departing from the registry’s required handling for rank-deficient covariate designs. citeturn1search1turn1search3

Severity: P1. Impact: the optimized covariate reg branch selected in diff_diff/staggered.py#L1136-L1143 bypasses solve_ols() and uses raw Cholesky / scipy.linalg.lstsq in diff_diff/staggered.py#L934-L955, while the shared helper _linear_regression() now sets skip_rank_check=True for every non-"error" mode in diff_diff/staggered.py#L138-L147. That contradicts the documented contract in docs/methodology/REGISTRY.md#L355-L358 and diff_diff/linalg.py#L384-L389: default rank_deficient_action="warn" is supposed to detect collinearity via pivoted QR and apply R-style NA/warning handling, not silently return minimum-norm or numerically unstable coefficients. The fast path can also drop whole ATT(g,t) cells on non-finite regression output in diff_diff/staggered.py#L957-L1013, which changes downstream aggregation weights. Concrete fix: do not skip rank checks in warn/silent modes; only cache/reuse factorizations after a one-time full-rank confirmation for that control design, and otherwise fall back to the existing solve_ols() / _outcome_regression() path.

Code Quality

No additional findings beyond the methodology issue above.

Performance

No blocking performance findings. The vectorization/caching direction is reasonable once the rank-deficiency safeguards are restored.

Maintainability

Severity: P1. Impact: the IF payload migration is only partial. Producers now emit treated_idx / control_idx in diff_diff/staggered.py#L598-L606 and diff_diff/staggered.py#L724-L728, but fallback consumers still expect treated_units / control_units in diff_diff/staggered_aggregation.py#L209-L215 and diff_diff/staggered_bootstrap.py#L179-L184. That is a partial pattern fix: any internal path that reaches those fallbacks without precomputed/global mappings will now raise KeyError, and the mixed schema makes future changes easy to get wrong. Concrete fix: finish the migration in every consumer, or temporarily keep both representations in influence_func_info until all fallbacks are updated.

Tech Debt

No additional findings.

Security

No findings in the diff reviewed. I did not see secrets or obvious security-sensitive changes.

Documentation/Tests

Severity: P2. Impact: the new validation harness records overall_p_value, overall_ci, and event-study effects in benchmarks/speed_review/validate_results.py#L70-L95, but check_results() ignores them and only compares overall ATT/SE plus per-cell effect/SE, with a very loose 0.01 absolute tolerance for group-time SEs in benchmarks/speed_review/validate_results.py#L149-L175. Existing collinearity tests in tests/test_staggered.py#L928-L980 cover only "error" and "silent", not the default "warn" mode that this PR changes. Concrete fix: add explicit warn-mode regression tests for reg and dr with duplicated/near-collinear covariates, and extend validation to compare event-study/group aggregates plus bootstrap p-values/CIs/cband outputs with tighter tolerances.

Verification note: I could not execute the test suite in this sandbox because the available Python environment does not have numpy or pytest installed.

…gthen validation - Revert _linear_regression skip_rank_check regression (always check rank) - Add one-time rank check to batched covariate path with lstsq fallback - Add dual-key IF payloads (treated_units/control_units + treated_idx/control_idx) - Extend validate_results.py with p-value, CI, event-study, and tighter SE checks - Add test_rank_deficient_action_warn_emits_warning test Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

igerber · 2026-03-08T22:26:00Z

/ai-review

github-actions · 2026-03-08T22:33:49Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 27bcc9f51ee0bc9485fd993f149f508249e25704

Overall assessment: ⚠️ Needs changes

Executive Summary

The PR changes the Callaway-Sant’Anna ATT(g,t) covariate-adjusted reg and dr paths, so the main review bar is parity with the registry’s documented estimator/inference contract.
The new batched reg path detects QR rank deficiency but then solves the full matrix with bare scipy.linalg.lstsq, so it no longer follows the documented R-style dropped-column behavior for near-collinear designs.
The new cached dr path tries Cholesky before any QR/tolerance check, so rank_deficient_action can be bypassed entirely when the design is numerically near-singular but still Cholesky-factorable.
The new batch bootstrap helper preserves NaN outputs, but the added validation/tests do not cover the new IPW not_yet_treated cache path, DR near-rank-deficiency, or degenerate per-effect bootstrap warning behavior.
Review is diff-based; I did not rerun the suite in this environment.

Methodology
Affected method: CallawaySantAnna ATT(g,t) with covariate-adjusted reg and dr, cross-checked against docs/methodology/REGISTRY.md:291 and the rank-deficiency contract at docs/methodology/REGISTRY.md:355.

Severity: P1. Impact: the optimized reg path warns on QR-detected rank deficiency, but then falls back to full-matrix scipy_linalg.lstsq rather than the shared QR-reduced solve. That is a methodological mismatch with the registry / shared linear algebra contract, which says rank-deficient designs are handled by dropping dependent columns at tolerance 1e-07 in R-style fashion. Near-collinear covariates can therefore change ATT(g,t) and SEs relative to the documented implementation, even though the warning implies the columns were dropped. Concrete fix: when _detect_rank_deficiency() flags dropped columns, solve the reduced kept_cols system (or delegate to solve_ols / _linear_regression with precomputed rank info) instead of calling bare lstsq on the full design. References: diff_diff/staggered.py:869, diff_diff/staggered.py:972, diff_diff/linalg.py:655.
Severity: P1. Impact: the new cached dr outcome-regression path attempts Cholesky before any QR-based rank check, and only falls back to _linear_regression() if Cholesky fails numerically. For designs that are rank-deficient under the library’s documented 1e-07 tolerance but still positive-definite enough for Cholesky, rank_deficient_action is never honored, warnings are skipped, and the optimized DR estimator silently deviates from the registry contract. Concrete fix: gate the cached Cholesky path behind the same rank-deficiency detection used by the shared OLS path, and only cache/use Cholesky when that check passes. References: diff_diff/staggered.py:1566, diff_diff/staggered.py:1586, docs/methodology/REGISTRY.md:355.

Code Quality
No separate findings beyond the solver-path drift already called out under Methodology.

Performance
No findings. The caching/vectorization direction looks reasonable once correctness is brought back in line with the shared solver behavior.

Maintainability
No separate findings.

Tech Debt
No findings.

Security
No findings.

Documentation/Tests

Severity: P2. Impact: the new high-risk paths are still under-tested. The added validation matrix covers dr_2cov_nyt and reg_2cov_nyt, but not the new IPW not_yet_treated cache path, and the new rank-deficiency tests only exercise exact duplicate columns on the reg fast path. There is also no test around the batched bootstrap helper’s degenerate / <50% valid per-effect warning behavior. Concrete fix: add ipw + covariates + not_yet_treated, DR rank_deficient_action in {"warn","silent"} with near-collinear (not exact-duplicate) covariates, and per-effect bootstrap tests asserting both NaN outputs and warning emission for invalid distributions. References: benchmarks/speed_review/validate_results.py:100, tests/test_staggered.py:888, tests/test_staggered.py:983, diff_diff/bootstrap_utils.py:327.

…e-07 Fix batched reg path: compute kept_cols on rank deficiency, skip per-pair Cholesky when rank-deficient, solve reduced system with zero-fill for dropped columns. Fix DR path: check rank before caching Cholesky factor, use False sentinel for rank-deficient keys, zero NaN coefficients for prediction. Add cond=1e-07 to all lstsq calls for solve_ols consistency. Add ipw_2cov_nyt validation scenario, DR rank test, reg nyt rank test, and bootstrap single-unit cohort edge case test. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

igerber · 2026-03-08T23:23:49Z

/ai-review

github-actions · 2026-03-08T23:32:55Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 2cf0973f48a2814c2d6808fac0e529f62daecc49

Overall assessment: ⚠️ Needs changes

Executive Summary

Affected method: CallawaySantAnna ATT(g,t) estimation on the covariate regression path, plus the multiplier-bootstrap inference helpers. The cache keys themselves are consistent with the registry’s base-period, anticipation, and control-group rules.
P1: the optimized reg + covariates path no longer preserves the documented per-cell rank-deficiency / invalid-cell behavior; on changing control samples it can silently solve under a different rule or omit ATT(g,t) cells entirely, which changes downstream aggregation weights.
P1: fit(..., covariates=[]) now routes into the covariate fast path and will dereference covariate_by_period=None instead of behaving like the no-covariate estimator.
P2: the new batched bootstrap stats helper drops the registry-mandated warning path when per-effect bootstrap validity falls below 50%.
P2: tests and the new validator do not cover the new estimation_method="reg" + covariates + n_bootstrap>0 path or several changed inference outputs.
Static review only: the sandbox Python here does not have numpy / pandas / scipy, so I could not execute the suite.

Methodology
Cross-check target: Callaway & Sant’Anna (2021) ATT(g,t) plus the registry’s edge-case requirements in docs/methodology/REGISTRY.md:L293-L364.

Severity P1. Impact: the new batched reg + covariates path checks rank only on the cached base control design, then later (g,t)-specific control designs can go through raw Cholesky / scipy_linalg.lstsq, and non-finite solves are dropped with continue. On unbalanced panels or control_group="not_yet_treated", that means the default rank_deficient_action="warn" / R-style NA behavior is no longer guaranteed per ATT(g,t), and failed cells can disappear from simple/event-study/group aggregation instead of surfacing as NaN. That is a methodological deviation from the documented ATT(g,t) handling and from the scalar regression path. Concrete fix: only reuse Cholesky for designs proven identical and full-rank; for every pair-specific pair_X_ctrl, rerun _detect_rank_deficiency() or call _linear_regression / solve_ols, and if the solve is invalid keep the (g,t) cell with NaN inference instead of omitting it. diff_diff/staggered.py:L869-L1057, diff_diff/staggered.py:L1372-L1437, docs/methodology/REGISTRY.md:L347-L358
Severity P2. Impact: compute_effect_bootstrap_stats_batch() returns all-NaN outputs when an effect fails the <50% valid bootstrap draws rule, but unlike the scalar helper it emits no warning/context. Since staggered_bootstrap.py now routes group-time, event-study, and group effects through this helper, invalid bootstrap inference on the changed paths can fail silently, which diverges from the registry’s documented “warn + NaN” behavior. Concrete fix: preserve the scalar helper’s warning semantics in batch mode, at least by warning once with the affected indices/labels before returning NaNs. diff_diff/bootstrap_utils.py:L317-L330, diff_diff/staggered_bootstrap.py:L339-L396, docs/methodology/REGISTRY.md:L359-L364

Code Quality

Severity P1. Impact: fit() now dispatches to _compute_all_att_gt_covariate_reg() whenever covariates is not None, but _precompute_structures() only builds covariate_by_period when covariates is truthy. A caller passing covariates=[] will therefore hit the covariate fast path with covariate_by_period=None and crash, whereas the pre-PR scalar path treated an empty list like “no covariates”. Concrete fix: normalize empty iterables to None before dispatch, or change the fast-path guard to if covariates:. diff_diff/staggered.py:L427-L435, diff_diff/staggered.py:L780-L833, diff_diff/staggered.py:L1173-L1187

Performance

No new blocking performance finding. The caching/factorization approach is reasonable once the behavior regressions above are fixed.

Maintainability

No separate maintainability finding beyond the scalar/batch semantic drift already noted above.

Tech Debt

No new tech-debt finding beyond the validation gap below.

Security

No security findings in the changed library code or the added benchmark artifacts.

Documentation/Tests

Severity P2. Impact: the added tests cover reg / dr not-yet-treated and rank-deficiency only with n_bootstrap=0, while the only bootstrap-with-covariates test still exercises the default DR path. The new validator also serializes only effect/SE outputs, not the changed t_stat / p_value / CI fields or group aggregation results. That leaves the optimized estimation_method="reg" path and the new batched inference helpers largely unguarded. Concrete fix: add targeted tests for estimation_method="reg", covariates, n_bootstrap>0 under both control groups and an unbalanced panel, plus direct scalar-vs-batch equivalence tests for safe_inference_batch() / compute_effect_bootstrap_stats_batch(), and extend the validator to compare group aggregation plus per-effect inference fields. tests/test_staggered.py:L888-L1090, tests/test_staggered.py:L1365-L1458, benchmarks/speed_review/validate_results.py:L52-L97, benchmarks/speed_review/validate_results.py:L165-L196

Address five failure modes observed after gpt-5.4 upgrade (PRs #192, #194, #195): documented deviations flagged as blockers, deferred work not accepted, moving goalposts on re-review, undefined approval criteria, and valid implementation choices treated as methodology errors. Changes to pr_review.md: - Exempt REGISTRY.md-documented deviations from P0/P1 (classify as P3) - Add implementation choice exception for valid numerical approaches - Add Deferred Work Acceptance section honoring TODO.md tracking - Add Assessment Criteria with objective verdicts and mitigation rules - Add Re-review Scope rules to prevent oscillation between rounds Changes to ai_pr_review.yml: - Add step to fetch previous AI review comment for re-review context - Inject prior review findings into compiled prompt on /ai-review reruns Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-14T19:28:35Z

/ai-review

github-actions · 2026-03-14T19:34:55Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 2cf0973f48a2814c2d6808fac0e529f62daecc49

Overall assessment

⚠️ Needs changes

Executive summary

Affected methods: Callaway & Sant’Anna ATT(g,t) estimation on the optimized estimation_method="reg" covariate path, plus the multiplier-bootstrap inference helpers. I cross-checked against docs/methodology/REGISTRY.md:L293-L364 and the changed estimator docstrings.
The Cholesky/logit caching itself is a valid numerical optimization, and the earlier not_yet_treated cache-key concern looks addressed by the new keying and regression tests.
P1: fit(..., covariates=[]) still routes into the covariate fast path and dereferences covariate_by_period=None, so the PR still introduces a reproducible crash on an empty-list input.
P1: the optimized covariate-regression path still drops numerically invalid ATT(g,t) cells entirely instead of preserving them as invalid cells, which changes downstream aggregation weights and conflicts with the registry’s missing-cell rule.
P2: the new batch bootstrap stats helper still suppresses the scalar warning path for invalid per-effect bootstrap inference, so ATT(g,t)/event-study/group bootstrap outputs can now return all-NaN silently.
Static review only: this environment does not have numpy, pandas, scipy, or pytest, so I could not execute the suite.

Methodology

Severity P1. Impact: the previous control-design/cache-key issue appears fixed, but _compute_all_att_gt_covariate_reg() still increments n_dropped_cells and continues when the per-pair regression solve, prediction, or residuals become non-finite. That removes the (g,t) cell from group_time_effects, so simple/event-study/group aggregation reweights over a smaller set instead of preserving the invalid cell. That conflicts with the registry rule that missing group-time cells should be represented as invalid (NaN), not silently omitted. Concrete fix: on those branches, fall back to the scalar pairwise path for that (g,t); if it is still invalid, materialize the cell with NaN inference instead of omitting it. References: diff_diff/staggered.py:L980-L1057, docs/methodology/REGISTRY.md:L347-L350
Severity P2. Impact: compute_effect_bootstrap_stats_batch() returns all-NaN outputs for <50% valid draws or zero/non-finite bootstrap SE, but it does not emit the warning path required by the registry. Since ATT(g,t), event-study, and group bootstrap stats now use that helper, these changed paths can fail silently while overall ATT still warns. Concrete fix: add batch warning support (for example via an optional contexts list) or route invalid subsets through compute_effect_bootstrap_stats(). References: diff_diff/bootstrap_utils.py:L283-L377, diff_diff/staggered_bootstrap.py:L339-L396, docs/methodology/REGISTRY.md:L359-L364

Code Quality

Severity P1. Impact: fit() dispatches to the optimized covariate regression branch when covariates is not None, but _precompute_structures() only builds covariate_by_period when covariates is truthy. Passing covariates=[] therefore reaches _compute_all_att_gt_covariate_reg() with covariate_by_period=None and crashes. Concrete fix: normalize empty iterables to None before preprocessing/dispatch, or change the optimized branch guard to if covariates and self.estimation_method == "reg". References: diff_diff/staggered.py:L427-L435, diff_diff/staggered.py:L754-L833, diff_diff/staggered.py:L1173-L1187

Performance

No findings. The Cholesky/logistic caching is a reasonable implementation choice once the correctness issues above are fixed.

Maintainability

No separate findings beyond the scalar/batch semantic drift already noted in Methodology.

Tech Debt

No separate findings. The remaining lower-severity coverage gap below is not tracked in TODO.md.

Security

No findings.

Documentation/Tests

Severity P3. Impact: the new speed-review validator materially improves coverage and does exercise reg/dr/ipw with bootstrap, including reg + covariates, but it still serializes only effect/SE fields for group-time and event-study outputs and omits group_effects plus per-effect t_stat / p_value / CI, which are also changed by this PR’s batched inference helpers. Concrete fix: extend the validator to compare those outputs and add one direct scalar-vs-batch regression test for compute_effect_bootstrap_stats_batch(), especially the warning/NaN edge cases. References: benchmarks/speed_review/validate_results.py:L52-L97, benchmarks/speed_review/validate_results.py:L165-L196

Path to Approval

Normalize empty covariates inputs before fast-path dispatch so covariates=[] behaves like the no-covariate estimator instead of crashing.
In _compute_all_att_gt_covariate_reg(), stop dropping numerically invalid (g,t) cells via continue; fall back to the scalar pairwise path or preserve the cell as invalid so aggregations do not silently reweight around it.

…es, batch warnings, validator fields - Normalize covariates=[] to None to prevent crash in covariate dispatch - Preserve invalid regression cells as NaN instead of dropping (REGISTRY.md compliance) - Filter NaN effects from aggregation in simple, event study, and group methods - Add summary warnings to batch bootstrap stats for insufficient samples and zero SE - Extend validator field coverage to include t_stat, p_value, conf_int, group_effects - Re-save baselines with extended schema Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

igerber · 2026-03-14T20:35:28Z

/ai-review

github-actions · 2026-03-14T20:42:51Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 4d54cd78760c803b9a3b714a02acfd72d96d642d

Overall Assessment

⚠️ Needs changes

Executive Summary

Affected methods: Callaway-Sant’Anna ATT(g,t) estimation on the optimized estimation_method="reg" covariate path, plus multiplier-bootstrap inference for overall ATT, event-study, and group aggregations.
The prior re-review blockers appear resolved: covariates=[] is normalized to None, numerically invalid ATT(g,t) cells are now preserved instead of dropped, and the new batch bootstrap helper emits warnings for insufficient valid draws / zero SE cases.
Severity P1 [Newly identified]: once the optimized regression path preserves an invalid ATT(g,t) cell as NaN, the bootstrap aggregation code still builds overall/event-study/group bootstrap targets from the unfiltered cell set. That contaminates aggregate bootstrap SE/CI/p-values even though the analytical aggregators now correctly exclude those NaN cells.
I did not find a methodology defect in the Cholesky/logit caching itself. The not_yet_treated cache-keying concern from the previous review looks addressed.
Static review only: I could not execute pytest or a targeted repro in this environment because pytest is not installed and numpy is unavailable.

Methodology

Affected methods: Callaway-Sant’Anna ATT(g,t) with optimized covariate regression, and bootstrap aggregation of ATT(g,t) into overall/event-study/group effects. Cross-checking against the registry, the caching is a valid numerical optimization; the remaining issue is the bootstrap treatment of preserved NaN cells.

Severity P1 [Newly identified]. Impact: the PR now preserves numerically invalid group-time cells as NaN in the optimized covariate-regression path, and the analytical aggregators explicitly exclude those NaN effects before reweighting. But _run_multiplier_bootstrap() still computes original_overall, event-study aggregates, and group aggregates from the raw original_atts / raw group lists, so one preserved NaN ATT(g,t) can poison bootstrap inference for an otherwise valid aggregate. That is an incorrect bootstrap SE/CI/p-value implementation for the changed path, and it no longer matches the estimand used by the reported analytical aggregate. Concrete fix: apply the same finite-cell filtering and weight renormalization in the bootstrap path that the analytical aggregators now use; specifically, filter non-finite ATT(g,t) cells out before computing overall_weights_post / original_overall, and do the same inside _prepare_event_study_aggregation() and _prepare_group_aggregation(). If no finite cells remain for an aggregate, return NaN inference for that aggregate explicitly. References: diff_diff/staggered.py:L1019, diff_diff/staggered_aggregation.py:L90, diff_diff/staggered_aggregation.py:L464, diff_diff/staggered_aggregation.py:L555, diff_diff/staggered_bootstrap.py:L223, diff_diff/staggered_bootstrap.py:L301, diff_diff/staggered_bootstrap.py:L453, diff_diff/staggered_bootstrap.py:L531, docs/methodology/REGISTRY.md:L347, docs/methodology/REGISTRY.md:L359

Code Quality

No separate findings. The earlier covariates=[] crash appears fixed at diff_diff/staggered.py:L1129.

Performance

No findings. The Cholesky/logistic-coefficient caching is a reasonable implementation choice and does not look like a methodology defect on its own.

Maintainability

No separate findings beyond the bootstrap/analytical aggregation drift above.

Tech Debt

No separate findings. TODO.md does not contain an entry that mitigates the P1 above, so it remains approval-blocking. See TODO.md:L38.

Security

No findings.

Documentation/Tests

Severity P3. Impact: the new coverage does not exercise the remaining failing interaction. The added preserved-NaN regression test stops at n_bootstrap=0, and the speed validator still omits group-level t_stat / p_value / CI fields, so this PR’s changed aggregate-bootstrap paths can still regress without detection. Concrete fix: add one bootstrap regression test that reuses the poisoned-lstsq setup with n_bootstrap>0 and aggregate="all" and asserts finite aggregates keep finite bootstrap inference after excluding invalid cells; extend the validator to serialize and compare group-effect inference fields as well. References: tests/test_staggered.py:L1049, benchmarks/speed_review/validate_results.py:L104, benchmarks/speed_review/validate_results.py:L229

Path to Approval

Filter non-finite ATT(g,t) cells out of the bootstrap overall-ATT aggregation before computing overall_weights_post, original_overall, and the combined IF, mirroring [diff_diff/staggered_aggregation.py:L90](/home/runner/work/diff-diff/diff-diff/diff_diff/staggered_aggregation.py#L90).
Apply the same finite-cell filtering and weight renormalization inside _prepare_event_study_aggregation() and _prepare_group_aggregation() so bootstrap event-study/group inference is based on the same cell set as the reported analytical aggregates.
Add a regression test that forces one optimized covariate-regression ATT(g,t) cell to NaN under n_bootstrap>0, aggregate="all" and asserts overall/event-study/group bootstrap SE/CI/p-values are computed from the remaining finite cells rather than all becoming NaN.

igerber mentioned this pull request Mar 14, 2026

Fix overly strict AI PR review prompt causing excessive review cycles #196

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize CallawaySantAnna covariate path (5-19x speedup)#194

Optimize CallawaySantAnna covariate path (5-19x speedup)#194
igerber wants to merge 5 commits intomainfrom
speed-review

igerber commented Mar 8, 2026

Uh oh!

github-actions bot commented Mar 8, 2026

Uh oh!

igerber commented Mar 8, 2026

Uh oh!

github-actions bot commented Mar 8, 2026

Uh oh!

igerber commented Mar 8, 2026

Uh oh!

github-actions bot commented Mar 8, 2026

Uh oh!

igerber commented Mar 8, 2026

Uh oh!

github-actions bot commented Mar 8, 2026

Uh oh!

igerber commented Mar 14, 2026

Uh oh!

github-actions bot commented Mar 14, 2026

Uh oh!

igerber commented Mar 14, 2026

Uh oh!

github-actions bot commented Mar 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

igerber commented Mar 8, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions bot commented Mar 8, 2026

Overall assessment: ⛔ Blocker

Executive summary

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Uh oh!

igerber commented Mar 8, 2026

Uh oh!

github-actions bot commented Mar 8, 2026

Uh oh!

igerber commented Mar 8, 2026

Uh oh!

github-actions bot commented Mar 8, 2026

Uh oh!

igerber commented Mar 8, 2026

Uh oh!

github-actions bot commented Mar 8, 2026

Uh oh!

igerber commented Mar 14, 2026

Uh oh!

github-actions bot commented Mar 14, 2026

Overall assessment

Executive summary

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Path to Approval

Uh oh!

igerber commented Mar 14, 2026

Uh oh!

github-actions bot commented Mar 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant