Skip to content

Replace regex with DOMParser in GPT document.write rewriting#503

Open
ChristianPavilonis wants to merge 3 commits intomainfrom
fix/gpt-regex-rewrite
Open

Replace regex with DOMParser in GPT document.write rewriting#503
ChristianPavilonis wants to merge 3 commits intomainfrom
fix/gpt-regex-rewrite

Conversation

@ChristianPavilonis
Copy link
Collaborator

Summary

  • Replace fragile SCRIPT_SRC_RE regex with DOMParser-based HTML parsing in the GPT document.write / document.writeln interception layer, fixing edge cases where unquoted attributes, unusual spacing, or mixed quote styles caused the regex to fail-open and bypass the first-party proxy.
  • Fail closed: when DOMParser is unavailable or throws, return an empty string instead of passing through unproxied GPT URLs.
  • Add 8 new test cases covering single-quoted attrs, extra whitespace, multiple script tags, writeln parity, non-GPT passthrough, protocol-relative URLs, and both fail-closed scenarios.

Changes

File Change
crates/js/lib/src/integrations/gpt/script_guard.ts Replaced SCRIPT_SRC_RE regex and regex-based rewriteHtmlString with DOMParser.parseFromString implementation; uses getAttribute("src") + replaceAll() for robust URL swapping; adds fast-path skip when GPT domain not present; fails closed on error
crates/js/lib/test/integrations/gpt/script_guard.test.ts Added 8 edge-case tests: single-quoted src, extra whitespace, multiple scripts, writeln parity, non-GPT passthrough, protocol-relative URLs, DOMParser unavailable, DOMParser throws

Closes

Closes #407

Test plan

  • cargo test --workspace
  • cargo clippy --all-targets --all-features -- -D warnings
  • cargo fmt --all -- --check
  • JS tests: cd crates/js/lib && npx vitest run
  • JS format: cd crates/js/lib && npm run format

Checklist

  • Changes follow CLAUDE.md conventions
  • No unwrap() in production code — use expect("should ...")
  • Uses tracing macros (not println!)
  • New code has tests
  • No secrets or credentials committed

@ChristianPavilonis ChristianPavilonis self-assigned this Mar 13, 2026
@ChristianPavilonis
Copy link
Collaborator Author

dismissed github bot findings that were introduced from previous commits and already determined to be false positives.

Copy link
Collaborator

@aram356 aram356 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Replaces a fragile regex with DOMParser-based HTML parsing for GPT document.write interception — a good direction with solid test coverage. One blocking issue around a missing fast-path guard.

Blocking

🔧 wrench

  • Missing fast-path guard: rewriteHtmlString enters the DOMParser / fail-closed path for all HTML, not just GPT-related strings. Non-GPT document.write calls will be silently dropped when DOMParser is unavailable, and unnecessarily parsed when it is. (script_guard.ts:208)

Non-blocking

🤔 thinking

  • replaceAll is a blunt instrument: Replaces the URL everywhere in the string, not just inside src attributes. Unlikely to cause issues with GPT's simple output, but less precise than the old regex. (script_guard.ts:229)

CI Status

  • cargo fmt: PASS
  • cargo clippy: PASS
  • cargo test: PASS
  • vitest: PASS
  • format-docs: PASS
  • format-typescript: PASS
  • CodeQL: PASS

Copy link
Collaborator

@prk-Jr prk-Jr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

This change replaces the GPT document.write regex rewrite path with a DOMParser-based implementation and adds broader coverage for edge-case markup. The parser-based approach is directionally better, but the new replacement strategy introduces a behavior regression for non-GPT writes under parser failure and a proxy-bypass case for HTML-escaped URLs.

Blocking

🔧 wrench

  • Non-GPT document.write output gets dropped when DOMParser is unavailable or broken: rewriteHtmlString() returns an empty string before it knows whether the HTML actually contains a GPT URL, so benign writes are erased in the very failure modes this PR adds support for.
  • HTML-escaped query strings can bypass the rewrite and leak back to the Google host: the new flow reads src through getAttribute() and then runs replaceAll() on the original HTML string, which misses common serialized forms such as & and can emit the original securepubads.g.doubleclick.net URL unchanged.

Non-blocking

🤔 thinking

  • The new test section still misses the DOMParser-specific regression cases above: adding coverage for escaped entities like & and for non-GPT pass-through when DOMParser is unavailable would catch both regressions.

CI Status

  • fmt: PASS
  • rust tests: PASS
  • vitest: PASS
  • CodeQL: PASS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Regex-based HTML rewriting in document.write interception can fail-open

3 participants