Skip to content

Separate eval-only workflow and change to 8k1k#911

Open
Oseltamivir wants to merge 53 commits intomainfrom
eval-8k1k-server-restart
Open

Separate eval-only workflow and change to 8k1k#911
Oseltamivir wants to merge 53 commits intomainfrom
eval-8k1k-server-restart

Conversation

@Oseltamivir
Copy link
Collaborator

@Oseltamivir Oseltamivir commented Mar 15, 2026

Summary

  • Decouple eval jobs from benchmark jobs into a dedicated eval / workflow stage, so evals can run independently with evals-only: true in perf-changelog
  • Add eval-only input to benchmark-tmpl.yml that skips throughput benchmarks and result file checks, running only lm_eval
  • Fix GSM8k prompt template: change #### <answer> to #### [number] to prevent models (e.g. Kimi-K2.5) from outputting the literal <answer> tag, which caused strict-match to return [invalid]
  • Add eval support to all 49 single-node benchmark scripts (server restart with native max context, eval context override, request timeout increase)
  • Update process_changelog.py and generate_sweep_configs.py to emit a separate evals matrix for eval-only jobs
  • Changed from 1k8k to 8k1k
  • Groups now include dp-attn (previously missing, so dp-attn variants were merged)
  • Instead of picking max-conc at highest/lowest TP, it picks all TPs at the highest conc + all TPs at the median conc.

Separate jobs on a normal run tested with:

- config-keys:
    - gptoss-fp4-h100-vllm
    - dsr1-fp8-mi325x-sglang

Implementation Details

  • get_native_max_context_length(model) — reads max_position_embeddings / max_sequence_length / seq_length / n_positions from the model's HuggingFace config.
  • max_output_tokens: dynamically computed as eval_context_len - 4096 (capped at
    16384), was fixed 8192
  • Score validation: new validate_scores.py — post-eval gate checking exact_match metrics ≥ 0.85, exits non-zero on failure or zero metrics found
  • Failure mode: eval-only uses if-no-files-found: error (was ignore)
  • GSM8k prompt: #### <answer>#### [number]
  • ATOM torchvision workaround: uninstalls torchvision before lm-eval install to avoid circular import
  • PYTHONDONTWRITEBYTECODE=1 added to workflow env to avoid EACCES error on pycache.
Screenshot 2026-03-21 at 9 05 29 PM

New job step: evals

Oseltamivir and others added 22 commits March 8, 2026 08:31
- Switch eval selection from 1k8k to 8k1k in mark_eval_entries()
- After throughput benchmark, kill server and restart with model's
  native max context length (max_position_embeddings) for eval
- Replace hardcoded gen_max_tokens=16384 and max_tokens=8192 with
  the native max model length
- Add _start_eval_server() supporting sglang, vllm, trtllm, and atom
- Add EVAL_SERVER_EXTRA_ARGS to benchmark scripts for correctness flags

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
max_tokens in gen_kwargs must be less than max_length to leave room
for the input prompt. Without this cap, the server rejects requests
where input_tokens + max_tokens > max_model_len.

Uses max_length - 4096 as the generation cap (min 8192).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use 70% of max_length for max_tokens generation cap, leaving 30%
for the prompt. Echo the budget for visibility.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add --no-skip-eval-only flag to wait_for_server_ready so
  _start_eval_server properly waits for the eval server
- Fix pip3 install in dsr1_fp8_h200.sh with --break-system-packages

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Bump eval concurrency to 64 via EVAL_CONCURRENT_REQUESTS, independent of benchmark CONC
- Cap eval max_gen_tokens to 8192 to avoid KV cache issues
- Read num_fewshot from eval YAML instead of CLI override
- Add 'trt' as alias for trtllm in eval server case
- Reduce seq len configs for OOM prevention
- Change eval sweep to top of curve and middle configs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…t error

The ATOM container has a broken torchvision that causes circular import
errors when lm_eval loads. Since we use local-chat-completions (API-based),
torchvision is not needed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…val call

Pass utils/evals/ directory to --tasks so lm-eval globs all *.yaml files
and runs gsm8k + gpqa_diamond consecutively in one invocation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…jection

max_gen_tokens was hardcoded to 16384 but servers with smaller
EVAL_MAX_MODEL_LEN (e.g. 9416) reject requests exceeding their limit.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
15 SGLang scripts now pass --context-length with compute_eval_context_length
when EVAL_ONLY=true. 3 vLLM scripts override MAX_MODEL_LEN similarly.
Reverts the max_gen_tokens cap since the server should have sufficient context.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… file

Was defaulting to utils/evals/gsm8k.yaml which caused lm-eval to only
run gsm8k. Directory path lets lm-eval glob all *.yaml files (gsm8k + gpqa).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…t to 1800s

5x avoids TRT OOM on H200 (47K vs 94K context). 1800s timeout prevents
single-request timeouts on slow models like Kimi K2.5.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
# Conflicts:
#	benchmarks/single_node/dsr1_fp4_b200.sh
#	benchmarks/single_node/dsr1_fp4_mi355x.sh
#	benchmarks/single_node/dsr1_fp8_b200.sh
#	benchmarks/single_node/dsr1_fp8_b200_mtp.sh
#	benchmarks/single_node/dsr1_fp8_b200_trt_mtp.sh
#	benchmarks/single_node/dsr1_fp8_h200_trt_mtp.sh
#	benchmarks/single_node/dsr1_fp8_mi300x.sh
#	benchmarks/single_node/dsr1_fp8_mi355x.sh
#	benchmarks/single_node/glm5_fp8_mi355x.sh
#	benchmarks/single_node/gptoss_fp4_mi300x.sh
#	benchmarks/single_node/gptoss_fp4_mi325x.sh
#	benchmarks/single_node/gptoss_fp4_mi355x.sh
#	benchmarks/single_node/kimik2.5_fp4_b200.sh
#	benchmarks/single_node/kimik2.5_int4_b200.sh
#	benchmarks/single_node/kimik2.5_int4_mi325x.sh
#	benchmarks/single_node/kimik2.5_int4_mi355x.sh
#	benchmarks/single_node/minimaxm2.5_fp8_b200.sh
#	benchmarks/single_node/minimaxm2.5_fp8_h200.sh
#	benchmarks/single_node/minimaxm2.5_fp8_mi300x.sh
#	benchmarks/single_node/minimaxm2.5_fp8_mi325x.sh
#	benchmarks/single_node/minimaxm2.5_fp8_mi355x.sh
#	benchmarks/single_node/qwen3.5_bf16_mi300x.sh
#	benchmarks/single_node/qwen3.5_bf16_mi325x.sh
#	benchmarks/single_node/qwen3.5_bf16_mi355x.sh
#	benchmarks/single_node/qwen3.5_fp8_mi300x.sh
#	benchmarks/single_node/qwen3.5_fp8_mi325x.sh
#	benchmarks/single_node/qwen3.5_fp8_mi355x.sh
@github-actions
Copy link
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

Copy link
Contributor

@functionstackx functionstackx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the PR @Oseltamivir ! seems great that we now parallel evals instead of having them in sequential

mostly LGTM

just had a couple of questions

  1. does this PR change the evals to use the industry standard num of shots for gsm8k?
  2. previously when evals hard crashed, the script still exits with success exit code, is this fixed now?

@Oseltamivir Oseltamivir requested a review from a team March 15, 2026 22:08
@Oseltamivir Oseltamivir force-pushed the eval-8k1k-server-restart branch from 600582e to 826035c Compare March 15, 2026 22:08
@Oseltamivir
Copy link
Collaborator Author

  1. Changed to 5-shot
  2. Yes, also added suggestion of failing when result <85%

@Oseltamivir Oseltamivir force-pushed the eval-8k1k-server-restart branch from e6a8801 to 4dcfc92 Compare March 22, 2026 05:44
@Oseltamivir Oseltamivir force-pushed the eval-8k1k-server-restart branch from f732c4f to f74ad43 Compare March 22, 2026 18:46
@Oseltamivir Oseltamivir force-pushed the eval-8k1k-server-restart branch from 78caf1c to 04c9e88 Compare March 23, 2026 06:07
@Oseltamivir Oseltamivir force-pushed the eval-8k1k-server-restart branch from b9b05a1 to 2f3a788 Compare March 23, 2026 06:12
Copy link
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inline comment about compute_eval_context_length() using benchmark_ctx * 1 (instead of * 5) was marked resolved, but the fix was never applied — the diff still reads local eval_ctx=$(( benchmark_ctx * 1 )), making eval context expansion a no-op across all 49 benchmark scripts.

Extended reasoning...

Overview

This PR separates eval-only benchmarking into a dedicated workflow, updates 49 single-node benchmark scripts with EVAL_ONLY mode support, and introduces compute_eval_context_length() / setup_eval_context() helpers in benchmark_lib.sh. The changes touch CI workflow YAML, runner launch scripts, utility Python, and all single-node benchmark scripts.

Critical Unfixed Bug

The compute_eval_context_length() function in benchmarks/benchmark_lib.sh computes the eval context as:

local eval_ctx=$(( benchmark_ctx * 1 ))

The multiplier should be 5 per the function own doc comment ("Uses 5x the benchmark context"). With * 1, every call to setup_eval_context() expands the context by a factor of one -- i.e., no expansion at all. For a typical 8k1k workload (ISL=8192, OSL=1024), the eval context is 9416 tokens instead of the intended ~47080. The inline comment flagging this (ID 2971153643, filed 2026-03-22) was marked as resolved in the GitHub thread, but the actual code change was never made. This is the single most critical bug in the PR since all 49 updated scripts rely on this function.

Security Risks

No security-sensitive paths (auth, crypto, permissions) are touched. The changes are benchmarking infrastructure only.

Level of Scrutiny

High. This is a large PR (71 files) making behavioral changes to production CI pipelines and all single-node benchmark scripts. The * 1 bug needs to be corrected before the PR can be merged.

Other Factors

All other previously flagged bugs appear to have been addressed in subsequent commits. The only blocking issue is the unfixed * 1 multiplier.

Copy link
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recent commits addressed the CUDA graph ordering and rm -rf cleanup issues — thanks for working through those. The one remaining blocker is benchmark_ctx * 1 on line 624 of benchmarks/benchmark_lib.sh, which needs to be * 5 per the function’s own docstring (the open inline comment from 3/22 tracks this).

Extended reasoning...

Overview

PR #911 separates eval-only jobs from throughput benchmarks, adds a compute_eval_context_length() helper to expand the context window for lm-eval runs, and changes the eval workload selection from 1k8k to 8k1k. 71 files are touched, the majority being single-node benchmark scripts that gained EVAL_ONLY blocks.

Security Risks

No security-sensitive code is touched. The PR handles CI workflow configuration, shell benchmark scripts, and Python sweep-config generation. There are no auth, crypto, or permissions changes.

Level of Scrutiny

Moderate. The logic in compute_eval_context_length() is new and central to the feature: every one of the 49 benchmark scripts calls setup_eval_context(), which delegates to this function. A one-character error in the multiplier (* 1 instead of * 5) silently produces no context expansion at all, making the entire eval-only path a no-op for its stated purpose.

Current Status

The "final" commit (2f3a788, 2026-03-23T06:12Z) addressed two previously flagged issues: the CUDA graph ordering bug in dsr1_fp8_b200_trt_mtp.sh and the rm -f/rm -rf cleanup discrepancy in benchmark-tmpl.yml. However, the compute_eval_context_length() multiplier at benchmarks/benchmark_lib.sh:624 still reads benchmark_ctx * 1. This inline comment thread has been open since 2026-03-22T06:44Z and remains unresolved. The fix is a one-character change but it is the core correctness property of the feature.

Copy link
Contributor

@functionstackx functionstackx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

Copy link
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two critical bugs flagged in prior review rounds remain unpatched in the actual code despite their GitHub conversation threads being marked resolved — please verify the fixes landed in the branch.

Extended reasoning...

Overview

This PR (72 files) separates eval-only CI runs from throughput benchmarks and switches eval workloads to an 8k/1k (ISL/OSL) profile. The changes touch CI workflow YAML, benchmark_lib.sh, generate_sweep_configs.py, process_changelog.py, validate_scores.py, and every single-node benchmark shell script.

Bugs Still Present in Current Branch

Two issues previously marked resolved on GitHub are not fixed in the files on the branch:

  1. benchmarks/benchmark_lib.sh line 624local eval_ctx=$(( benchmark_ctx * 1 )) still multiplies by 1 instead of 5. This makes compute_eval_context_length() a no-op: the server launched in eval-only mode gets exactly the same (tiny) context window as during throughput benchmarks, defeating the entire purpose of the EVAL_ONLY flow.

  2. benchmarks/single_node/dsr1_fp8_mi355x_atom_mtp.sh lines 87–88 — A stray set -x followed by rm -rf ./utils/bench_serving\ (with a bash line-continuation backslash) still unconditionally deletes the benchmark serving utilities after every run on MI355X hardware.

Security / Stability Risk

The rm -rf on line 88 is a destructive filesystem operation that runs outside any conditional block. It will silently destroy tooling on every benchmark execution, requiring manual recovery. The * 1 multiplier bug silently produces incorrect eval results (accuracy measured at the wrong context length) without any error.

Level of Scrutiny

Both of these were flagged as critical in earlier review rounds. The inline comments were marked resolved, which typically signals the author fixed them — but cross-checking the actual file contents shows the fixes were not committed. This needs explicit human confirmation that the correct patch is in place before merging.

Other Factors

A fix verified in code (dsr1_fp8_b200_trt_mtp.sh EVAL_ONLY ordering) shows the process works when changes are actually committed. The remaining two issues just need the already-identified one-line fixes to be applied and pushed.

# Conflicts:
#	benchmarks/single_node/kimik2.5_fp4_mi355x.sh
#	perf-changelog.yaml
#	runners/launch_mi300x-amd.sh
#	runners/launch_mi300x-cr.sh
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

4 participants