fix: add --exclusive to MI355X multi-node sbatch for accurate benchmarks#932
fix: add --exclusive to MI355X multi-node sbatch for accurate benchmarks#932
Conversation
Signed-off-by: seungrokj <seungrok.jung@amd.com>
Without --exclusive, SLURM can co-schedule other jobs on the same nodes, causing memory bandwidth, XGMI/Infinity Fabric, and NUMA contention that degrades benchmark results — especially for non-TP8 disaggregated serving configs where cross-node RDMA traffic is latency-sensitive. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Closing in favor of #929 which already covers MI355X. |
| - config-keys: | ||
| - dsr1-fp4-mi355x-atom | ||
| - dsr1-fp4-mi355x-atom-mtp | ||
| - dsr1-fp4-mi355x-sglang | ||
| - dsr1-fp4-mi355x-sglang-disagg | ||
| - dsr1-fp4-mi355x-sglang-disagg-mtp | ||
| - dsr1-fp8-mi355x-sglang | ||
| - dsr1-fp8-mi355x-sglang-disagg | ||
| - dsr1-fp8-mi355x-sglang-disagg-mtp | ||
| - gptoss-fp4-mi355x-atom | ||
| - gptoss-fp4-mi355x-vllm | ||
| - minimaxm2.5-fp8-mi355x-vllm | ||
| description: | ||
| - "Add --exclusive flag to MI355X multi-node sbatch to prevent node sharing during benchmarks" | ||
| - "Only non-TP8 configs are affected; TP8 already uses all GPUs on the node" | ||
| pr-link: TBD | ||
|
|
There was a problem hiding this comment.
🟡 The first changelog entry has two documentation inaccuracies: pr-link: TBD should be https://github.com/SemiAnalysisAI/InferenceX/pull/932, and the entry description ("multi-node sbatch") incorrectly lists 7 single-node configs (dsr1-fp4-mi355x-atom, dsr1-fp4-mi355x-atom-mtp, dsr1-fp4-mi355x-sglang, dsr1-fp8-mi355x-sglang, gptoss-fp4-mi355x-atom, gptoss-fp4-mi355x-vllm, minimaxm2.5-fp8-mi355x-vllm) that route through salloc, not sbatch. The config-keys list should contain only the 4 -disagg variants affected by the submit.sh change; the single-node configs are already covered by the PR #929 entry at the bottom of the file.
Extended reasoning...
Issue 1 — pr-link: TBD placeholder
The first changelog entry sets pr-link: TBD (line 16) instead of the actual PR URL. The project convention visible throughout the file is either a full URL (https://github.com/SemiAnalysisAI/InferenceX/pull/NNN) or the in-progress placeholder https://github.com/SemiAnalysisAI/InferenceX/pull/XXX. The bare string TBD is the only such occurrence in the entire file. Since this PR is #932, the correct value is https://github.com/SemiAnalysisAI/InferenceX/pull/932.
One verifier noted that the Pydantic model defines pr_link as a plain str with no URL validation, so TBD does pass schema validation. This is accurate — the bug is not tooling-breaking. However, it is a clear inconsistency with every other entry and a straightforward oversight to fix.
Issue 2 — Single-node configs listed under multi-node sbatch description
The only code change in submit.sh adds --exclusive to the sbatch call in the multi-node disaggregated path. Yet the changelog entry lists 11 config keys, 7 of which are single-node configs that never invoke submit.sh:
dsr1-fp4-mi355x-atom,dsr1-fp4-mi355x-atom-mtp— usesallocinlaunch_mi355x-amds.sh(theelsebranch, confirmed in the diff)dsr1-fp4-mi355x-sglang,dsr1-fp8-mi355x-sglang— likewisesallocin theelsebranchgptoss-fp4-mi355x-atom,gptoss-fp4-mi355x-vllm,minimaxm2.5-fp8-mi355x-vllm— also single-nodesallocpath
Only the 4 -disagg configs (dsr1-fp4-mi355x-sglang-disagg, dsr1-fp4-mi355x-sglang-disagg-mtp, dsr1-fp8-mi355x-sglang-disagg, dsr1-fp8-mi355x-sglang-disagg-mtp) actually route through IS_MULTINODE=true into the sbatch path in submit.sh.
Step-by-step proof:
- In
launch_mi355x-amds.sh, the top-levelif [[ "$IS_MULTINODE" == "true" ]]branch callsbash "benchmarks/multi_node/${SCRIPT_NAME}", which invokessubmit.shcontaining the new--exclusiveflag. - All other configs fall into the
elsebranch which callssalloc --partition=$PARTITION --gres=gpu:$TP --exclusive ...— thissalloc --exclusiveflag was already added by PR [AMD] Fix slurm command for AMD devices #929. - The PR [AMD] Fix slurm command for AMD devices #929 entry at the bottom of
perf-changelog.yamluses"*mi355x*"and"*mi325x*"wildcards, which already match all single-node MI355X configs including every one incorrectly listed in the first entry. - Therefore, including single-node configs in the first entry not only has an inaccurate description but also duplicates coverage already provided by the PR [AMD] Fix slurm command for AMD devices #929 entry.
Impact and fix:
Both issues are documentation-only — the actual --exclusive flag in submit.sh is correctly placed and functional. The fix is: (1) update pr-link: TBD to https://github.com/SemiAnalysisAI/InferenceX/pull/932, and (2) remove the 7 single-node config keys from the first entry, leaving only the 4 -disagg configs.
Summary
--exclusiveflag tosbatchinbenchmarks/multi_node/amd_utils/submit.shto prevent node sharing during multi-node MI355X benchmarkssallocalready has--exclusive; this brings the multi-node path to parityTest plan
🤖 Generated with Claude Code