Skip to content

Optimize CallawaySantAnna covariate path (5-19x speedup)#194

Open
igerber wants to merge 5 commits intomainfrom
speed-review
Open

Optimize CallawaySantAnna covariate path (5-19x speedup)#194
igerber wants to merge 5 commits intomainfrom
speed-review

Conversation

@igerber
Copy link
Owner

@igerber igerber commented Mar 8, 2026

Summary

  • Cache Cholesky factorizations of X^T X across (g,t) pairs sharing the same control design matrix, eliminating redundant OLS solves in the covariate regression path
  • Cache propensity score logistic regression coefficients for IPW/DR methods, avoiding redundant BFGS optimization
  • Add _compute_all_att_gt_covariate_reg() batched method for the reg+covariates path, mirroring the existing _compute_all_att_gt_vectorized() pattern
  • Extend benchmark and validation suites with 10-covariate, DR, and IPW scenarios

Speedups at 5K units:

Scenario Before After Speedup
2 covariates, reg 0.158s 0.028s 5.6x
10 covariates, reg 0.535s 0.042s 12.7x
2 covariates, DR 0.373s 0.059s 6.3x
10 cov, reg, 50K units 4.430s 0.230s 19.3x

Methodology references (required if estimator / math changes)

  • Method name(s): Callaway & Sant'Anna (2021) ATT(g,t) estimation
  • Paper / source link(s): Callaway, B., & Sant'Anna, P. H. (2021). Difference-in-Differences with multiple time periods. Journal of Econometrics, 225(2), 200-230.
  • Any intentional deviations from the source (and why): None — Cholesky factorization caching is a numerical optimization that produces identical results to lstsq (validated to 1e-10 tolerance). Falls back to lstsq on singular/ill-conditioned matrices.

Validation

  • Tests added/updated: benchmarks/speed_review/bench_callaway.py, benchmarks/speed_review/validate_results.py
  • 96/96 staggered tests pass, 1489/1489 full suite tests pass
  • 5-scenario numerical validation (reg/DR/IPW × 0/2/10 covariates) matches baselines saved before optimization

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

…hing

Cache Cholesky factorizations of X^T X across (g,t) pairs that share the
same control design matrix, eliminating redundant OLS solves. Cache
propensity score coefficients for IPW/DR methods. Achieves 5-19x speedup
on covariate regression path depending on k.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link

github-actions bot commented Mar 8, 2026

Overall assessment: ⛔ Blocker

Executive summary

  • This PR affects CallawaySantAnna’s covariate-adjusted ATT(g,t) paths, especially estimation_method="dr" and estimation_method="reg".
  • P0: the new DR Cholesky cache key is too coarse for control_group="not_yet_treated", so cached outcome regressions can be reused across different cohort-specific control sets.
  • P1: rank-deficient covariate handling no longer matches the Methodology Registry / public rank_deficient_action contract; the fast path skips QR rank checks and bypasses solve_ols.
  • The new benchmark validator would not catch either issue: it never exercises control_group="not_yet_treated" or rank-deficient covariates, and it barely checks inference outputs.
  • No direct security issues stand out.

Methodology

  • Severity: P0 | Impact: The DR cache breaks the documented not_yet_treated control definition. cho_key is set to (base_period_val, t) for all non-never_treated cases, but under control_group="not_yet_treated" the control design also depends on cohort g because cohort g must be excluded from controls for each ATT(g,t). That means the cached Cholesky factor can be reused across different cohorts that share the same (base, t) under the default base_period="varying", giving the wrong outcome-regression fit for at least the lead / pre-treatment comparisons. This violates the registry’s control-mask rule and the paper’s per-ATT(g,t) group-time setup. docs/methodology/REGISTRY.md:395, diff_diff/staggered.py:315, diff_diff/staggered.py:567, diff_diff/staggered.py:1137, diff_diff/staggered.py:1529. The source paper explicitly frames this estimator around group-time ATT(g,t) objects with OR/IPW/DR estimation, so cache reuse is only valid when the comparison design is unchanged. citeturn2view0
    Concrete fix: include g in cho_key whenever control membership depends on cohort (at minimum all control_group="not_yet_treated" cases), or disable DR Cholesky caching for that mode; add a regression test comparing cached vs uncached DR outputs for control_group="not_yet_treated".

  • Severity: P1 | Impact: Rank-deficient handling no longer matches the documented methodology/default behavior. The registry and public docstring say rank_deficient_action="warn" should detect collinearity via pivoted QR and return R-style dropped-column / NA behavior, but _linear_regression() now sets skip_rank_check=True for every non-"error" case, and the new covariate-reg fast path bypasses solve_ols entirely via Cholesky / raw lstsq. That silently changes estimator behavior for collinear or near-collinear covariates instead of honoring the documented contract. docs/methodology/REGISTRY.md:355, diff_diff/staggered.py:210, diff_diff/staggered.py:138, diff_diff/staggered.py:933, diff_diff/linalg.py:384.
    Concrete fix: preserve QR/SVD rank detection for warn and silent, and only take the Cholesky fast path after a verified full-rank check at the same 1e-07 tolerance; otherwise fall back to solve_ols with the requested rank_deficient_action.

Code Quality

  • Severity: P2 | Impact: _compute_all_att_gt_covariate_reg() silently drops problematic (g,t) cells with continue when beta, predictions, or residuals become non-finite. That mutates the set of estimated cells and therefore downstream aggregation weights instead of surfacing an explicit warning / NaN result. diff_diff/staggered.py:951, diff_diff/staggered.py:964, diff_diff/staggered.py:1180.
    Concrete fix: preserve the cell with effect=np.nan, se=np.nan, and NaN inference fields (plus a warning), or fall back to the scalar path rather than silently removing it.

Performance

  • Severity: P2 | Impact: The speedup currently relies on unchecked invariants rather than proving them. The bad DR cache key and the skipped rank checks mean part of the benchmark win comes from dropping correctness safeguards, so the performance numbers are not decision-useful yet for all supported modes. diff_diff/staggered.py:138, diff_diff/staggered.py:567, diff_diff/staggered.py:933.
    Concrete fix: keep the optimization, but gate cache reuse on explicit design-equivalence checks and preserve the old solver semantics before advertising the speed path as methodology-preserving.

Maintainability

  • Severity: P2 | Impact: The new _compute_all_att_gt_covariate_reg() duplicates large parts of _outcome_regression() / _linear_regression() instead of sharing a single cached-solver abstraction, and it has already drifted on rank handling and failure behavior. diff_diff/staggered.py:785, diff_diff/staggered.py:1324.
    Concrete fix: factor a shared OLS helper that accepts an optional cached factorization but still centralizes rank checks, fallbacks, warnings, and influence-function construction.

Tech Debt

  • Severity: P2 | Impact: The migration from unit IDs to index arrays is incomplete. Producers now emit treated_idx / control_idx, but fallback branches still reference treated_units / control_units, leaving two incompatible internal schemas. diff_diff/staggered.py:598, diff_diff/staggered_aggregation.py:210, diff_diff/staggered_bootstrap.py:180.
    Concrete fix: finish the migration in the fallback branches, or explicitly require precomputed in those paths so dead compatibility code cannot silently rot.

Security

  • No material findings.

Documentation/Tests

  • Severity: P2 | Impact: The added validation is benchmark-oriented, not methodology-oriented. It only covers five scenarios, all with the default control-group path; it never exercises control_group="not_yet_treated" or rank-deficient covariates, and it does not compare overall_p_value, overall_ci, event-study outputs, or group aggregations. The 0.01 SE tolerance is also loose enough to miss meaningful inference drift. benchmarks/speed_review/validate_results.py:99, benchmarks/speed_review/validate_results.py:144, benchmarks/speed_review/validate_results.py:169.
    Concrete fix: add real regression tests for not_yet_treated + DR + covariates, rank_deficient_action parity (warn / silent / error), and NaN/Inf bootstrap edge cases; extend snapshot checks to all inference fields and all aggregation objects.

Address PR #194 review feedback:
- Include cohort g in cho_key and control_key for not_yet_treated
  control group, preventing stale Cholesky/matrix reuse across groups
- Add UserWarning when group-time cells are dropped due to non-finite
  regression results in covariate reg path
- Add not_yet_treated validation scenarios (dr_2cov_nyt, reg_2cov_nyt)
- Add regression test for DR/reg + covariates + not_yet_treated

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Mar 8, 2026

/ai-review

@github-actions
Copy link

github-actions bot commented Mar 8, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 01688aeb711828128409c4bb7f92177e90ea6142


Overall assessment: ⚠️ Needs changes

Executive summary

  • Affected methods: CallawaySantAnna ATT(g,t) estimation in the covariate-adjusted reg path, the DR first-stage outcome regression, and the influence-function bootstrap/aggregation plumbing.
  • I do not see a direct estimand-level conflict with Callaway-Sant’Anna or the official did::att_gt() options themselves; the problem is that the optimization changes documented edge-case behavior around those methods, especially rank-deficient covariate designs. citeturn1search1turn1search3
  • P1: the new default covariate path bypasses the library’s required pivoted-QR rank-deficiency handling, so collinear covariates can now silently change ATT(g,t) behavior instead of warning and dropping redundant columns.
  • P1: the migration from unit-id IF payloads to index arrays is incomplete; fallback consumers still read removed keys and can now fail if they are reached.
  • P2: the new validation coverage is too weak for the changed code paths; it does not check most of the new aggregation/bootstrap/inference outputs and misses the default rank_deficient_action="warn" mode.

Methodology
Affected methods are the covariate-adjusted Callaway-Sant’Anna paths (reg, dr, ipw) and the control_group / base_period variants exposed by did::att_gt(). The paper/docs support those options; the issue here is not the caching idea, but the implementation departing from the registry’s required handling for rank-deficient covariate designs. citeturn1search1turn1search3

  • Severity: P1. Impact: the optimized covariate reg branch selected in diff_diff/staggered.py#L1136-L1143 bypasses solve_ols() and uses raw Cholesky / scipy.linalg.lstsq in diff_diff/staggered.py#L934-L955, while the shared helper _linear_regression() now sets skip_rank_check=True for every non-"error" mode in diff_diff/staggered.py#L138-L147. That contradicts the documented contract in docs/methodology/REGISTRY.md#L355-L358 and diff_diff/linalg.py#L384-L389: default rank_deficient_action="warn" is supposed to detect collinearity via pivoted QR and apply R-style NA/warning handling, not silently return minimum-norm or numerically unstable coefficients. The fast path can also drop whole ATT(g,t) cells on non-finite regression output in diff_diff/staggered.py#L957-L1013, which changes downstream aggregation weights. Concrete fix: do not skip rank checks in warn/silent modes; only cache/reuse factorizations after a one-time full-rank confirmation for that control design, and otherwise fall back to the existing solve_ols() / _outcome_regression() path.

Code Quality

  • No additional findings beyond the methodology issue above.

Performance

  • No blocking performance findings. The vectorization/caching direction is reasonable once the rank-deficiency safeguards are restored.

Maintainability

Tech Debt

  • No additional findings.

Security

  • No findings in the diff reviewed. I did not see secrets or obvious security-sensitive changes.

Documentation/Tests

  • Severity: P2. Impact: the new validation harness records overall_p_value, overall_ci, and event-study effects in benchmarks/speed_review/validate_results.py#L70-L95, but check_results() ignores them and only compares overall ATT/SE plus per-cell effect/SE, with a very loose 0.01 absolute tolerance for group-time SEs in benchmarks/speed_review/validate_results.py#L149-L175. Existing collinearity tests in tests/test_staggered.py#L928-L980 cover only "error" and "silent", not the default "warn" mode that this PR changes. Concrete fix: add explicit warn-mode regression tests for reg and dr with duplicated/near-collinear covariates, and extend validation to compare event-study/group aggregates plus bootstrap p-values/CIs/cband outputs with tighter tolerances.

Verification note: I could not execute the test suite in this sandbox because the available Python environment does not have numpy or pytest installed.

…gthen validation

- Revert _linear_regression skip_rank_check regression (always check rank)
- Add one-time rank check to batched covariate path with lstsq fallback
- Add dual-key IF payloads (treated_units/control_units + treated_idx/control_idx)
- Extend validate_results.py with p-value, CI, event-study, and tighter SE checks
- Add test_rank_deficient_action_warn_emits_warning test

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Mar 8, 2026

/ai-review

@github-actions
Copy link

github-actions bot commented Mar 8, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 27bcc9f51ee0bc9485fd993f149f508249e25704


Overall assessment: ⚠️ Needs changes

Executive Summary

  • The PR changes the Callaway-Sant’Anna ATT(g,t) covariate-adjusted reg and dr paths, so the main review bar is parity with the registry’s documented estimator/inference contract.
  • The new batched reg path detects QR rank deficiency but then solves the full matrix with bare scipy.linalg.lstsq, so it no longer follows the documented R-style dropped-column behavior for near-collinear designs.
  • The new cached dr path tries Cholesky before any QR/tolerance check, so rank_deficient_action can be bypassed entirely when the design is numerically near-singular but still Cholesky-factorable.
  • The new batch bootstrap helper preserves NaN outputs, but the added validation/tests do not cover the new IPW not_yet_treated cache path, DR near-rank-deficiency, or degenerate per-effect bootstrap warning behavior.
  • Review is diff-based; I did not rerun the suite in this environment.

Methodology
Affected method: CallawaySantAnna ATT(g,t) with covariate-adjusted reg and dr, cross-checked against docs/methodology/REGISTRY.md:291 and the rank-deficiency contract at docs/methodology/REGISTRY.md:355.

  • Severity: P1. Impact: the optimized reg path warns on QR-detected rank deficiency, but then falls back to full-matrix scipy_linalg.lstsq rather than the shared QR-reduced solve. That is a methodological mismatch with the registry / shared linear algebra contract, which says rank-deficient designs are handled by dropping dependent columns at tolerance 1e-07 in R-style fashion. Near-collinear covariates can therefore change ATT(g,t) and SEs relative to the documented implementation, even though the warning implies the columns were dropped. Concrete fix: when _detect_rank_deficiency() flags dropped columns, solve the reduced kept_cols system (or delegate to solve_ols / _linear_regression with precomputed rank info) instead of calling bare lstsq on the full design. References: diff_diff/staggered.py:869, diff_diff/staggered.py:972, diff_diff/linalg.py:655.

  • Severity: P1. Impact: the new cached dr outcome-regression path attempts Cholesky before any QR-based rank check, and only falls back to _linear_regression() if Cholesky fails numerically. For designs that are rank-deficient under the library’s documented 1e-07 tolerance but still positive-definite enough for Cholesky, rank_deficient_action is never honored, warnings are skipped, and the optimized DR estimator silently deviates from the registry contract. Concrete fix: gate the cached Cholesky path behind the same rank-deficiency detection used by the shared OLS path, and only cache/use Cholesky when that check passes. References: diff_diff/staggered.py:1566, diff_diff/staggered.py:1586, docs/methodology/REGISTRY.md:355.

Code Quality
No separate findings beyond the solver-path drift already called out under Methodology.

Performance
No findings. The caching/vectorization direction looks reasonable once correctness is brought back in line with the shared solver behavior.

Maintainability
No separate findings.

Tech Debt
No findings.

Security
No findings.

Documentation/Tests

  • Severity: P2. Impact: the new high-risk paths are still under-tested. The added validation matrix covers dr_2cov_nyt and reg_2cov_nyt, but not the new IPW not_yet_treated cache path, and the new rank-deficiency tests only exercise exact duplicate columns on the reg fast path. There is also no test around the batched bootstrap helper’s degenerate / <50% valid per-effect warning behavior. Concrete fix: add ipw + covariates + not_yet_treated, DR rank_deficient_action in {"warn","silent"} with near-collinear (not exact-duplicate) covariates, and per-effect bootstrap tests asserting both NaN outputs and warning emission for invalid distributions. References: benchmarks/speed_review/validate_results.py:100, tests/test_staggered.py:888, tests/test_staggered.py:983, diff_diff/bootstrap_utils.py:327.

…e-07

Fix batched reg path: compute kept_cols on rank deficiency, skip per-pair
Cholesky when rank-deficient, solve reduced system with zero-fill for
dropped columns. Fix DR path: check rank before caching Cholesky factor,
use False sentinel for rank-deficient keys, zero NaN coefficients for
prediction. Add cond=1e-07 to all lstsq calls for solve_ols consistency.

Add ipw_2cov_nyt validation scenario, DR rank test, reg nyt rank test,
and bootstrap single-unit cohort edge case test.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Mar 8, 2026

/ai-review

@github-actions
Copy link

github-actions bot commented Mar 8, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 2cf0973f48a2814c2d6808fac0e529f62daecc49


Overall assessment: ⚠️ Needs changes

Executive Summary

  • Affected method: CallawaySantAnna ATT(g,t) estimation on the covariate regression path, plus the multiplier-bootstrap inference helpers. The cache keys themselves are consistent with the registry’s base-period, anticipation, and control-group rules.
  • P1: the optimized reg + covariates path no longer preserves the documented per-cell rank-deficiency / invalid-cell behavior; on changing control samples it can silently solve under a different rule or omit ATT(g,t) cells entirely, which changes downstream aggregation weights.
  • P1: fit(..., covariates=[]) now routes into the covariate fast path and will dereference covariate_by_period=None instead of behaving like the no-covariate estimator.
  • P2: the new batched bootstrap stats helper drops the registry-mandated warning path when per-effect bootstrap validity falls below 50%.
  • P2: tests and the new validator do not cover the new estimation_method="reg" + covariates + n_bootstrap>0 path or several changed inference outputs.
  • Static review only: the sandbox Python here does not have numpy / pandas / scipy, so I could not execute the suite.

Methodology
Cross-check target: Callaway & Sant’Anna (2021) ATT(g,t) plus the registry’s edge-case requirements in docs/methodology/REGISTRY.md:L293-L364.

  • Severity P1. Impact: the new batched reg + covariates path checks rank only on the cached base control design, then later (g,t)-specific control designs can go through raw Cholesky / scipy_linalg.lstsq, and non-finite solves are dropped with continue. On unbalanced panels or control_group="not_yet_treated", that means the default rank_deficient_action="warn" / R-style NA behavior is no longer guaranteed per ATT(g,t), and failed cells can disappear from simple/event-study/group aggregation instead of surfacing as NaN. That is a methodological deviation from the documented ATT(g,t) handling and from the scalar regression path. Concrete fix: only reuse Cholesky for designs proven identical and full-rank; for every pair-specific pair_X_ctrl, rerun _detect_rank_deficiency() or call _linear_regression / solve_ols, and if the solve is invalid keep the (g,t) cell with NaN inference instead of omitting it. diff_diff/staggered.py:L869-L1057, diff_diff/staggered.py:L1372-L1437, docs/methodology/REGISTRY.md:L347-L358
  • Severity P2. Impact: compute_effect_bootstrap_stats_batch() returns all-NaN outputs when an effect fails the <50% valid bootstrap draws rule, but unlike the scalar helper it emits no warning/context. Since staggered_bootstrap.py now routes group-time, event-study, and group effects through this helper, invalid bootstrap inference on the changed paths can fail silently, which diverges from the registry’s documented “warn + NaN” behavior. Concrete fix: preserve the scalar helper’s warning semantics in batch mode, at least by warning once with the affected indices/labels before returning NaNs. diff_diff/bootstrap_utils.py:L317-L330, diff_diff/staggered_bootstrap.py:L339-L396, docs/methodology/REGISTRY.md:L359-L364

Code Quality

  • Severity P1. Impact: fit() now dispatches to _compute_all_att_gt_covariate_reg() whenever covariates is not None, but _precompute_structures() only builds covariate_by_period when covariates is truthy. A caller passing covariates=[] will therefore hit the covariate fast path with covariate_by_period=None and crash, whereas the pre-PR scalar path treated an empty list like “no covariates”. Concrete fix: normalize empty iterables to None before dispatch, or change the fast-path guard to if covariates:. diff_diff/staggered.py:L427-L435, diff_diff/staggered.py:L780-L833, diff_diff/staggered.py:L1173-L1187

Performance

  • No new blocking performance finding. The caching/factorization approach is reasonable once the behavior regressions above are fixed.

Maintainability

  • No separate maintainability finding beyond the scalar/batch semantic drift already noted above.

Tech Debt

  • No new tech-debt finding beyond the validation gap below.

Security

  • No security findings in the changed library code or the added benchmark artifacts.

Documentation/Tests

  • Severity P2. Impact: the added tests cover reg / dr not-yet-treated and rank-deficiency only with n_bootstrap=0, while the only bootstrap-with-covariates test still exercises the default DR path. The new validator also serializes only effect/SE outputs, not the changed t_stat / p_value / CI fields or group aggregation results. That leaves the optimized estimation_method="reg" path and the new batched inference helpers largely unguarded. Concrete fix: add targeted tests for estimation_method="reg", covariates, n_bootstrap>0 under both control groups and an unbalanced panel, plus direct scalar-vs-batch equivalence tests for safe_inference_batch() / compute_effect_bootstrap_stats_batch(), and extend the validator to compare group aggregation plus per-effect inference fields. tests/test_staggered.py:L888-L1090, tests/test_staggered.py:L1365-L1458, benchmarks/speed_review/validate_results.py:L52-L97, benchmarks/speed_review/validate_results.py:L165-L196

igerber added a commit that referenced this pull request Mar 14, 2026
Address five failure modes observed after gpt-5.4 upgrade (PRs #192, #194,
#195): documented deviations flagged as blockers, deferred work not accepted,
moving goalposts on re-review, undefined approval criteria, and valid
implementation choices treated as methodology errors.

Changes to pr_review.md:
- Exempt REGISTRY.md-documented deviations from P0/P1 (classify as P3)
- Add implementation choice exception for valid numerical approaches
- Add Deferred Work Acceptance section honoring TODO.md tracking
- Add Assessment Criteria with objective verdicts and mitigation rules
- Add Re-review Scope rules to prevent oscillation between rounds

Changes to ai_pr_review.yml:
- Add step to fetch previous AI review comment for re-review context
- Inject prior review findings into compiled prompt on /ai-review reruns

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Mar 14, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: 2cf0973f48a2814c2d6808fac0e529f62daecc49


Overall assessment

⚠️ Needs changes

Executive summary

  • Affected methods: Callaway & Sant’Anna ATT(g,t) estimation on the optimized estimation_method="reg" covariate path, plus the multiplier-bootstrap inference helpers. I cross-checked against docs/methodology/REGISTRY.md:L293-L364 and the changed estimator docstrings.
  • The Cholesky/logit caching itself is a valid numerical optimization, and the earlier not_yet_treated cache-key concern looks addressed by the new keying and regression tests.
  • P1: fit(..., covariates=[]) still routes into the covariate fast path and dereferences covariate_by_period=None, so the PR still introduces a reproducible crash on an empty-list input.
  • P1: the optimized covariate-regression path still drops numerically invalid ATT(g,t) cells entirely instead of preserving them as invalid cells, which changes downstream aggregation weights and conflicts with the registry’s missing-cell rule.
  • P2: the new batch bootstrap stats helper still suppresses the scalar warning path for invalid per-effect bootstrap inference, so ATT(g,t)/event-study/group bootstrap outputs can now return all-NaN silently.
  • Static review only: this environment does not have numpy, pandas, scipy, or pytest, so I could not execute the suite.

Methodology

  • Severity P1. Impact: the previous control-design/cache-key issue appears fixed, but _compute_all_att_gt_covariate_reg() still increments n_dropped_cells and continues when the per-pair regression solve, prediction, or residuals become non-finite. That removes the (g,t) cell from group_time_effects, so simple/event-study/group aggregation reweights over a smaller set instead of preserving the invalid cell. That conflicts with the registry rule that missing group-time cells should be represented as invalid (NaN), not silently omitted. Concrete fix: on those branches, fall back to the scalar pairwise path for that (g,t); if it is still invalid, materialize the cell with NaN inference instead of omitting it. References: diff_diff/staggered.py:L980-L1057, docs/methodology/REGISTRY.md:L347-L350
  • Severity P2. Impact: compute_effect_bootstrap_stats_batch() returns all-NaN outputs for <50% valid draws or zero/non-finite bootstrap SE, but it does not emit the warning path required by the registry. Since ATT(g,t), event-study, and group bootstrap stats now use that helper, these changed paths can fail silently while overall ATT still warns. Concrete fix: add batch warning support (for example via an optional contexts list) or route invalid subsets through compute_effect_bootstrap_stats(). References: diff_diff/bootstrap_utils.py:L283-L377, diff_diff/staggered_bootstrap.py:L339-L396, docs/methodology/REGISTRY.md:L359-L364

Code Quality

  • Severity P1. Impact: fit() dispatches to the optimized covariate regression branch when covariates is not None, but _precompute_structures() only builds covariate_by_period when covariates is truthy. Passing covariates=[] therefore reaches _compute_all_att_gt_covariate_reg() with covariate_by_period=None and crashes. Concrete fix: normalize empty iterables to None before preprocessing/dispatch, or change the optimized branch guard to if covariates and self.estimation_method == "reg". References: diff_diff/staggered.py:L427-L435, diff_diff/staggered.py:L754-L833, diff_diff/staggered.py:L1173-L1187

Performance

  • No findings. The Cholesky/logistic caching is a reasonable implementation choice once the correctness issues above are fixed.

Maintainability

  • No separate findings beyond the scalar/batch semantic drift already noted in Methodology.

Tech Debt

  • No separate findings. The remaining lower-severity coverage gap below is not tracked in TODO.md.

Security

  • No findings.

Documentation/Tests

  • Severity P3. Impact: the new speed-review validator materially improves coverage and does exercise reg/dr/ipw with bootstrap, including reg + covariates, but it still serializes only effect/SE fields for group-time and event-study outputs and omits group_effects plus per-effect t_stat / p_value / CI, which are also changed by this PR’s batched inference helpers. Concrete fix: extend the validator to compare those outputs and add one direct scalar-vs-batch regression test for compute_effect_bootstrap_stats_batch(), especially the warning/NaN edge cases. References: benchmarks/speed_review/validate_results.py:L52-L97, benchmarks/speed_review/validate_results.py:L165-L196

Path to Approval

  1. Normalize empty covariates inputs before fast-path dispatch so covariates=[] behaves like the no-covariate estimator instead of crashing.
  2. In _compute_all_att_gt_covariate_reg(), stop dropping numerically invalid (g,t) cells via continue; fall back to the scalar pairwise path or preserve the cell as invalid so aggregations do not silently reweight around it.

…es, batch warnings, validator fields

- Normalize covariates=[] to None to prevent crash in covariate dispatch
- Preserve invalid regression cells as NaN instead of dropping (REGISTRY.md compliance)
- Filter NaN effects from aggregation in simple, event study, and group methods
- Add summary warnings to batch bootstrap stats for insufficient samples and zero SE
- Extend validator field coverage to include t_stat, p_value, conf_int, group_effects
- Re-save baselines with extended schema

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Mar 14, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: 4d54cd78760c803b9a3b714a02acfd72d96d642d


Overall Assessment

⚠️ Needs changes

Executive Summary

  • Affected methods: Callaway-Sant’Anna ATT(g,t) estimation on the optimized estimation_method="reg" covariate path, plus multiplier-bootstrap inference for overall ATT, event-study, and group aggregations.
  • The prior re-review blockers appear resolved: covariates=[] is normalized to None, numerically invalid ATT(g,t) cells are now preserved instead of dropped, and the new batch bootstrap helper emits warnings for insufficient valid draws / zero SE cases.
  • Severity P1 [Newly identified]: once the optimized regression path preserves an invalid ATT(g,t) cell as NaN, the bootstrap aggregation code still builds overall/event-study/group bootstrap targets from the unfiltered cell set. That contaminates aggregate bootstrap SE/CI/p-values even though the analytical aggregators now correctly exclude those NaN cells.
  • I did not find a methodology defect in the Cholesky/logit caching itself. The not_yet_treated cache-keying concern from the previous review looks addressed.
  • Static review only: I could not execute pytest or a targeted repro in this environment because pytest is not installed and numpy is unavailable.

Methodology

Affected methods: Callaway-Sant’Anna ATT(g,t) with optimized covariate regression, and bootstrap aggregation of ATT(g,t) into overall/event-study/group effects. Cross-checking against the registry, the caching is a valid numerical optimization; the remaining issue is the bootstrap treatment of preserved NaN cells.

Code Quality

No separate findings. The earlier covariates=[] crash appears fixed at diff_diff/staggered.py:L1129.

Performance

No findings. The Cholesky/logistic-coefficient caching is a reasonable implementation choice and does not look like a methodology defect on its own.

Maintainability

No separate findings beyond the bootstrap/analytical aggregation drift above.

Tech Debt

No separate findings. TODO.md does not contain an entry that mitigates the P1 above, so it remains approval-blocking. See TODO.md:L38.

Security

No findings.

Documentation/Tests

  • Severity P3. Impact: the new coverage does not exercise the remaining failing interaction. The added preserved-NaN regression test stops at n_bootstrap=0, and the speed validator still omits group-level t_stat / p_value / CI fields, so this PR’s changed aggregate-bootstrap paths can still regress without detection. Concrete fix: add one bootstrap regression test that reuses the poisoned-lstsq setup with n_bootstrap>0 and aggregate="all" and asserts finite aggregates keep finite bootstrap inference after excluding invalid cells; extend the validator to serialize and compare group-effect inference fields as well. References: tests/test_staggered.py:L1049, benchmarks/speed_review/validate_results.py:L104, benchmarks/speed_review/validate_results.py:L229

Path to Approval

  1. Filter non-finite ATT(g,t) cells out of the bootstrap overall-ATT aggregation before computing overall_weights_post, original_overall, and the combined IF, mirroring [diff_diff/staggered_aggregation.py:L90](/home/runner/work/diff-diff/diff-diff/diff_diff/staggered_aggregation.py#L90).
  2. Apply the same finite-cell filtering and weight renormalization inside _prepare_event_study_aggregation() and _prepare_group_aggregation() so bootstrap event-study/group inference is based on the same cell set as the reported analytical aggregates.
  3. Add a regression test that forces one optimized covariate-regression ATT(g,t) cell to NaN under n_bootstrap>0, aggregate="all" and asserts overall/event-study/group bootstrap SE/CI/p-values are computed from the remaining finite cells rather than all becoming NaN.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant