Skip to content

Fix PDF loading failures from trailing null padding and PdfName cache eviction#57

Merged
Mythie merged 3 commits intomainfrom
issue/54
Mar 21, 2026
Merged

Fix PDF loading failures from trailing null padding and PdfName cache eviction#57
Mythie merged 3 commits intomainfrom
issue/54

Conversation

@Mythie
Copy link
Contributor

@Mythie Mythie commented Mar 21, 2026

Fixes #54.

findStartXRef misses startxref behind trailing null padding

Some systems pad PDFs with null bytes after %%EOF. The 1024-byte backward search lands entirely in padding. Now we skip trailing whitespace first, then search.

Brute-force recovery fails on streams with indirect /Length

IndirectObjectParser during recovery has no lengthResolver, so /Length 42 0 R throws. If that stream is an ObjStm, its compressed objects are lost. Now catches the failure and scans for endstream.

PdfName LRU evicts names still held as PdfDict keys

PdfDict uses Map<PdfName, PdfObject> reference equality. The 10k LRU could evict names still in use as keys, so dict.get("Root") silently returns undefined. Replaced with WeakRef + FinalizationRegistry.

Names stay cached as long as anyone holds a reference. Load test confirms the old code breaks under pressure.

Mythie added 3 commits March 21, 2026 12:45
PDFs padded with null bytes beyond %%EOF (common when uploaded through
systems that pad to block boundaries) caused startxref lookup to fail
because the 1024-byte search window fell entirely within padding.
Skip trailing whitespace to find the effective end of file before
searching. Fixes #54.
During brute-force recovery, IndirectObjectParser has no lengthResolver,
so streams with indirect /Length references (e.g. /Length 42 0 R) would
fail to parse. This prevented object streams from being read, making
their compressed objects invisible to recovery. Now scans forward for
the endstream keyword as a fallback, matching the approach used by
pdf.js and PDFBox. Partial fix for #54.
The LRU cache (max 10k) could evict PdfName instances still held as
keys in PdfDict's Map<PdfName, PdfObject>, causing silent lookup
failures via reference inequality. This manifests in long-running
servers processing many PDFs with diverse name sets.

Replace with a WeakRef-based cache (matching PDFBox's COSName
approach): names stay interned as long as any live object holds a
strong reference, and a FinalizationRegistry cleans up dead entries.
Also expands the permanent cache with trailer keys (Root, Size, Info,
Prev, ID, Encrypt) and high-frequency names (Subtype, Font, BaseFont,
Encoding, XObject, Annots, Names). Closes #54.
@vercel
Copy link
Contributor

vercel bot commented Mar 21, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
core Ready Ready Preview, Comment Mar 21, 2026 2:50am

@github-actions
Copy link
Contributor

Benchmark Results

Comparison

Load PDF

Benchmark Mean p99 RME Samples
libpdf 2.28ms 3.26ms ±1.5% 220
pdf-lib 39.39ms 44.97ms ±4.8% 13
@cantoo/pdf-lib 38.61ms 42.87ms ±2.5% 13

Create blank PDF

Benchmark Mean p99 RME Samples
libpdf 58μs 131μs ±1.6% 8568
pdf-lib 411μs 1.44ms ±2.4% 1217
@cantoo/pdf-lib 443μs 1.67ms ±2.7% 1130

Add 10 pages

Benchmark Mean p99 RME Samples
libpdf 103μs 277μs ±1.3% 4854
pdf-lib 540μs 2.00ms ±3.0% 927
@cantoo/pdf-lib 499μs 2.51ms ±3.7% 1005

Draw 50 rectangles

Benchmark Mean p99 RME Samples
libpdf 323μs 916μs ±1.8% 1549
pdf-lib 1.77ms 6.72ms ±6.8% 285
@cantoo/pdf-lib 2.05ms 5.83ms ±5.7% 244

Load and save PDF

Benchmark Mean p99 RME Samples
libpdf 2.38ms 4.71ms ±2.3% 210
pdf-lib 90.37ms 125.39ms ±10.4% 10
@cantoo/pdf-lib 156.46ms 161.20ms ±1.1% 10

Load, modify, and save PDF

Benchmark Mean p99 RME Samples
libpdf 42.73ms 47.73ms ±3.9% 12
pdf-lib 86.39ms 95.34ms ±3.7% 10
@cantoo/pdf-lib 155.32ms 158.56ms ±1.3% 10

Extract single page from 100-page PDF

Benchmark Mean p99 RME Samples
libpdf 3.68ms 5.84ms ±1.8% 136
pdf-lib 9.18ms 13.49ms ±2.5% 55
@cantoo/pdf-lib 9.75ms 12.24ms ±2.6% 52

Split 100-page PDF into single-page PDFs

Benchmark Mean p99 RME Samples
libpdf 33.21ms 36.20ms ±2.6% 16
pdf-lib 87.83ms 90.92ms ±2.4% 6
@cantoo/pdf-lib 95.28ms 109.76ms ±8.9% 6

Split 2000-page PDF into single-page PDFs (0.9MB)

Benchmark Mean p99 RME Samples
libpdf 612.13ms 612.13ms ±0.0% 1
pdf-lib 1.67s 1.67s ±0.0% 1
@cantoo/pdf-lib 1.73s 1.73s ±0.0% 1

Copy 10 pages between documents

Benchmark Mean p99 RME Samples
libpdf 4.63ms 5.66ms ±1.4% 109
pdf-lib 12.15ms 14.56ms ±2.0% 42
@cantoo/pdf-lib 14.00ms 19.11ms ±3.5% 36

Merge 2 x 100-page PDFs

Benchmark Mean p99 RME Samples
libpdf 14.86ms 21.36ms ±3.3% 34
pdf-lib 55.24ms 58.07ms ±2.2% 10
@cantoo/pdf-lib 64.76ms 67.93ms ±1.9% 8

Fill FINTRAC form fields

Benchmark Mean p99 RME Samples
libpdf 21.13ms 24.63ms ±3.6% 24
pdf-lib 34.81ms 42.54ms ±5.3% 15
@cantoo/pdf-lib 36.24ms 44.21ms ±5.8% 14

Fill and flatten FINTRAC form

Benchmark Mean p99 RME Samples
libpdf 20.08ms 35.25ms ±7.9% 25
pdf-lib FAILED - - 0
@cantoo/pdf-lib 39.70ms 44.90ms ±4.7% 13
Copying

Copy pages between documents

Benchmark Mean p99 RME Samples
copy 1 page 1.04ms 2.01ms ±2.5% 483
copy 10 pages from 100-page PDF 4.46ms 5.20ms ±1.0% 112
copy all 100 pages 7.34ms 9.47ms ±1.3% 69

Duplicate pages within same document

Benchmark Mean p99 RME Samples
duplicate page 0 882μs 1.40ms ±1.1% 568
duplicate all pages (double the document) 873μs 1.61ms ±1.1% 573

Merge PDFs

Benchmark Mean p99 RME Samples
merge 2 small PDFs 1.45ms 2.29ms ±1.3% 345
merge 10 small PDFs 7.76ms 9.32ms ±1.2% 65
merge 2 x 100-page PDFs 13.42ms 13.92ms ±0.8% 38
Drawing

benchmarks/drawing.bench.ts

Benchmark Mean p99 RME Samples
draw 100 rectangles 544μs 1.30ms ±1.8% 920
draw 100 circles 1.28ms 3.15ms ±3.1% 392
draw 100 lines 518μs 1.23ms ±2.1% 966
draw 100 text lines (standard font) 1.56ms 2.30ms ±1.3% 321
create 10 pages with mixed content 1.37ms 2.70ms ±2.2% 366
Forms

benchmarks/forms.bench.ts

Benchmark Mean p99 RME Samples
get form fields 3.61ms 8.65ms ±4.7% 139
fill text fields 11.92ms 18.64ms ±4.6% 42
read field values 2.88ms 3.69ms ±1.2% 174
flatten form 8.52ms 12.82ms ±3.1% 59
Loading

benchmarks/loading.bench.ts

Benchmark Mean p99 RME Samples
load small PDF (888B) 57μs 130μs ±0.7% 8819
load medium PDF (19KB) 88μs 119μs ±0.5% 5681
load form PDF (116KB) 1.29ms 1.85ms ±0.9% 388
load heavy PDF (9.9MB) 2.17ms 2.61ms ±0.7% 231
Saving

benchmarks/saving.bench.ts

Benchmark Mean p99 RME Samples
save unmodified (19KB) 108μs 248μs ±4.9% 4613
save with modifications (19KB) 753μs 1.42ms ±1.3% 665
incremental save (19KB) 160μs 324μs ±1.0% 3132
save heavy PDF (9.9MB) 2.28ms 2.80ms ±1.1% 220
incremental save heavy PDF (9.9MB) 8.38ms 10.01ms ±3.2% 60
Splitting

Extract single page

Benchmark Mean p99 RME Samples
extractPages (1 page from small PDF) 1.04ms 2.18ms ±2.4% 481
extractPages (1 page from 100-page PDF) 3.92ms 6.93ms ±2.9% 128
extractPages (1 page from 2000-page PDF) 62.27ms 65.16ms ±1.7% 10

Split into single-page PDFs

Benchmark Mean p99 RME Samples
split 100-page PDF (0.1MB) 32.09ms 37.83ms ±4.0% 16
split 2000-page PDF (0.9MB) 572.79ms 572.79ms ±0.0% 1

Batch page extraction

Benchmark Mean p99 RME Samples
extract first 10 pages from 2000-page PDF 61.93ms 63.39ms ±1.2% 9
extract first 100 pages from 2000-page PDF 65.27ms 66.70ms ±1.5% 8
extract every 10th page from 2000-page PDF (200 pages) 71.17ms 75.55ms ±2.3% 8
Environment
  • Runner: Linux (X64)
  • Runtime: Bun 1.3.11

Results are machine-dependent.

@Mythie Mythie merged commit f8cde4a into main Mar 21, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

findStartXRef fails on PDFs with large trailing null padding, causing brute-force fallback to miss compressed objects

1 participant