Fix tokenizer mismatch between benchmark client and sglang server #937
Open
ZhaiFeiyue wants to merge 2 commits intoSemiAnalysisAI:mainfrom
Open
Fix tokenizer mismatch between benchmark client and sglang server #937ZhaiFeiyue wants to merge 2 commits intoSemiAnalysisAI:mainfrom
ZhaiFeiyue wants to merge 2 commits intoSemiAnalysisAI:mainfrom
Conversation
…transformers v5 Transformers v5 incorrectly rebuilds pre_tokenizer/decoder components for models like DeepSeek-R1 that use LlamaTokenizerFast with a non-Llama tokenizer architecture. The sglang server fixes this at startup, but the benchmark client loads the tokenizer without these fixes, causing a ~5x token count inflation (e.g. 7000 tokens -> 35000 tokens) and false performance regressions in TTFT and throughput benchmarks. Apply the same tokenizer fixes (pre_tokenizer/decoder restoration and add_bos_token recovery) that sglang server applies, so client and server tokenize identically. No-op on transformers v4. Made-with: Cursor
Contributor
|
thanks @ZhaiFeiyue is for all models in sglang or just for glm5? |
Collaborator
|
@ZhaiFeiyue Thank you for identifying this error and for the fix. +1 to functionstackx comment. It will be helpful to identify which configs from amd-master.yaml and nvidia-master.yaml are affected by this so that we can re-run and get accurate perf across the board. +viz @kedarpotdar-nv |
Author
hi @functionstackx we only found this issue for deepseek-R1, not do any test for glm5 |
kkHuang-amd
pushed a commit
to kkHuang-amd/useful-scripts
that referenced
this pull request
Mar 26, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Problem
When benchmarking sglang server performance with the latest sgalng (which uses transformers 5.3.0), we observed a significant performance regression compared to the older
0313-2image (transformers 4.57.1):Root Cause
Transformers v5 changed how
LlamaTokenizerFastis loaded. During__init__, v5 rebuilds thepre_tokenizeranddecoderfrom scratch using Llama-specific components, discarding the originals defined intokenizer.json. For models like DeepSeek-R1 that declareLlamaTokenizerFastbut actually use a ByteLevel/Sequence tokenizer architecture, v5 incorrectly replaces:pre_tokenizer:Sequence→Metaspacedecoder:ByteLevel→SequenceThe sglang server already fixes this at startup via
_fix_v5_tokenizer_components()inhf_transformers_utils.py, but the benchmark client loads the tokenizer directly viaAutoTokenizer.from_pretrained()without these fixes. This mismatch causes the client and server to tokenize text differently:The ~5x token inflation on the server side results in 5x more prefill work, which directly causes the observed TTFT and throughput regression.
Fix
Add
_fix_tokenizer_for_sglang()to the benchmark client'sget_tokenizer(), applying the same two fixes that sglang server applies:tokenizer.jsonand overwrite the incorrect v5 replacements.add_bos_token/add_eos_tokenflags fromtokenizer_config.jsonthat v5 strips during loading.This is a no-op on transformers v4.
Results
Token count (per request, ~7400 token input):
Benchmark (10 prompts, concurrency=1, input ~8K, output ~1K,
--disable-radix-cache):After fix, new image performance matches old image — no actual regression.