Fix tokenizer mismatch between benchmark client and sglang server by ZhaiFeiyue · Pull Request #937 · SemiAnalysisAI/InferenceX

ZhaiFeiyue · 2026-03-24T15:01:20Z

Summary

Problem

When benchmarking sglang server performance with the latest sgalng (which uses transformers 5.3.0), we observed a significant performance regression compared to the older 0313-2 image (transformers 4.57.1):

TTFT: 1394ms vs 482ms (~3x slower)
Output throughput: 160 tok/s vs 212 tok/s (~24% drop)

Root Cause

Transformers v5 changed how LlamaTokenizerFast is loaded. During __init__, v5 rebuilds the pre_tokenizer and decoder from scratch using Llama-specific components, discarding the originals defined in tokenizer.json. For models like DeepSeek-R1 that declare LlamaTokenizerFast but actually use a ByteLevel/Sequence tokenizer architecture, v5 incorrectly replaces:

pre_tokenizer: Sequence → Metaspace
decoder: ByteLevel → Sequence

The sglang server already fixes this at startup via _fix_v5_tokenizer_components() in hf_transformers_utils.py, but the benchmark client loads the tokenizer directly via AutoTokenizer.from_pretrained() without these fixes. This mismatch causes the client and server to tokenize text differently:

	Client (unfixed)	Server (fixed)
pre_tokenizer	Metaspace	Sequence
100 random tokens → text → re-encode	102 tokens	531 tokens
7140 token prompt	7140 tokens	35921 tokens

The ~5x token inflation on the server side results in 5x more prefill work, which directly causes the observed TTFT and throughput regression.

Fix

Add _fix_tokenizer_for_sglang() to the benchmark client's get_tokenizer(), applying the same two fixes that sglang server applies:

pre_tokenizer/decoder restoration: Read the original components from tokenizer.json and overwrite the incorrect v5 replacements.
add_bos_token recovery: Restore add_bos_token/add_eos_token flags from tokenizer_config.json that v5 strips during loading.

This is a no-op on transformers v4.

Results

Token count (per request, ~7400 token input):

	Before fix	After fix
Client-side token count	7140	7140
Server-side prompt_tokens	35921 (5x inflated)	7141 (correct)
Prefill chunks	3 (16384+16384+3153)	1 (7141)

Benchmark (10 prompts, concurrency=1, input ~8K, output ~1K, --disable-radix-cache):

Metric	Before fix	After fix	Old image (baseline)
Output throughput (tok/s)	160.49	211.71	212.00
Mean TTFT (ms)	1394.84	215.00	215.26
Mean TPOT (ms)	4.74	4.49	4.26
Mean E2EL (ms)	5796.15	4393.74	4387.54

After fix, new image performance matches old image — no actual regression.

…transformers v5 Transformers v5 incorrectly rebuilds pre_tokenizer/decoder components for models like DeepSeek-R1 that use LlamaTokenizerFast with a non-Llama tokenizer architecture. The sglang server fixes this at startup, but the benchmark client loads the tokenizer without these fixes, causing a ~5x token count inflation (e.g. 7000 tokens -> 35000 tokens) and false performance regressions in TTFT and throughput benchmarks. Apply the same tokenizer fixes (pre_tokenizer/decoder restoration and add_bos_token recovery) that sglang server applies, so client and server tokenize identically. No-op on transformers v4. Made-with: Cursor

functionstackx · 2026-03-24T20:00:16Z

thanks @ZhaiFeiyue is for all models in sglang or just for glm5?

cquil11 · 2026-03-24T20:13:28Z

@ZhaiFeiyue Thank you for identifying this error and for the fix. +1 to functionstackx comment. It will be helpful to identify which configs from amd-master.yaml and nvidia-master.yaml are affected by this so that we can re-run and get accurate perf across the board.

+viz @kedarpotdar-nv

ZhaiFeiyue · 2026-03-25T01:19:12Z

thanks @ZhaiFeiyue is for all models in sglang or just for glm5?

hi @functionstackx we only found this issue for deepseek-R1, not do any test for glm5

…ference: SemiAnalysisAI/InferenceX#937

ZhaiFeiyue requested a review from a team March 24, 2026 15:01

github-project-automation bot added this to InferenceMAX Board Mar 24, 2026

functionstackx requested review from Oseltamivir and cquil11 March 24, 2026 20:00

Merge branch 'main' into fix_tokenizer

7e573e1

kkHuang-amd pushed a commit to kkHuang-amd/useful-scripts that referenced this pull request Mar 26, 2026

Fix tokenizer mismatch between benchmark client and sglang server, re…

0335915

…ference: SemiAnalysisAI/InferenceX#937

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix tokenizer mismatch between benchmark client and sglang server #937

Fix tokenizer mismatch between benchmark client and sglang server #937
ZhaiFeiyue wants to merge 2 commits intoSemiAnalysisAI:mainfrom
ZhaiFeiyue:fix_tokenizer

ZhaiFeiyue commented Mar 24, 2026

Uh oh!

functionstackx commented Mar 24, 2026

Uh oh!

cquil11 commented Mar 24, 2026

Uh oh!

ZhaiFeiyue commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ZhaiFeiyue commented Mar 24, 2026

Summary

Problem

Root Cause

Fix

Results

Uh oh!

functionstackx commented Mar 24, 2026

Uh oh!

cquil11 commented Mar 24, 2026

Uh oh!

ZhaiFeiyue commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants