Skip to content

fix: add --exclusive to MI355X multi-node sbatch for accurate benchmarks#932

Closed
cquil11 wants to merge 4 commits intomainfrom
fix/mi355x-exclusive
Closed

fix: add --exclusive to MI355X multi-node sbatch for accurate benchmarks#932
cquil11 wants to merge 4 commits intomainfrom
fix/mi355x-exclusive

Conversation

@cquil11
Copy link
Collaborator

@cquil11 cquil11 commented Mar 23, 2026

Summary

  • Add --exclusive flag to sbatch in benchmarks/multi_node/amd_utils/submit.sh to prevent node sharing during multi-node MI355X benchmarks
  • Only non-TP8 configs are affected (TP8 already uses all GPUs); perf-changelog updated for all 11 non-TP8 MI355X configs
  • Single-node MI355X salloc already has --exclusive; this brings the multi-node path to parity

Test plan

  • Run a non-TP8 MI355X multi-node disagg benchmark (e.g. dsr1-fp8-mi355x-sglang-disagg TP4) and verify results are consistent with exclusive node access

🤖 Generated with Claude Code

seungrokj and others added 4 commits March 23, 2026 13:56
Signed-off-by: seungrokj <seungrok.jung@amd.com>
Without --exclusive, SLURM can co-schedule other jobs on the same nodes,
causing memory bandwidth, XGMI/Infinity Fabric, and NUMA contention that
degrades benchmark results — especially for non-TP8 disaggregated serving
configs where cross-node RDMA traffic is latency-sensitive.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cquil11 cquil11 requested a review from a team March 23, 2026 14:57
@cquil11
Copy link
Collaborator Author

cquil11 commented Mar 23, 2026

Closing in favor of #929 which already covers MI355X.

Comment on lines +1 to +17
- config-keys:
- dsr1-fp4-mi355x-atom
- dsr1-fp4-mi355x-atom-mtp
- dsr1-fp4-mi355x-sglang
- dsr1-fp4-mi355x-sglang-disagg
- dsr1-fp4-mi355x-sglang-disagg-mtp
- dsr1-fp8-mi355x-sglang
- dsr1-fp8-mi355x-sglang-disagg
- dsr1-fp8-mi355x-sglang-disagg-mtp
- gptoss-fp4-mi355x-atom
- gptoss-fp4-mi355x-vllm
- minimaxm2.5-fp8-mi355x-vllm
description:
- "Add --exclusive flag to MI355X multi-node sbatch to prevent node sharing during benchmarks"
- "Only non-TP8 configs are affected; TP8 already uses all GPUs on the node"
pr-link: TBD

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The first changelog entry has two documentation inaccuracies: pr-link: TBD should be https://github.com/SemiAnalysisAI/InferenceX/pull/932, and the entry description ("multi-node sbatch") incorrectly lists 7 single-node configs (dsr1-fp4-mi355x-atom, dsr1-fp4-mi355x-atom-mtp, dsr1-fp4-mi355x-sglang, dsr1-fp8-mi355x-sglang, gptoss-fp4-mi355x-atom, gptoss-fp4-mi355x-vllm, minimaxm2.5-fp8-mi355x-vllm) that route through salloc, not sbatch. The config-keys list should contain only the 4 -disagg variants affected by the submit.sh change; the single-node configs are already covered by the PR #929 entry at the bottom of the file.

Extended reasoning...

Issue 1 — pr-link: TBD placeholder

The first changelog entry sets pr-link: TBD (line 16) instead of the actual PR URL. The project convention visible throughout the file is either a full URL (https://github.com/SemiAnalysisAI/InferenceX/pull/NNN) or the in-progress placeholder https://github.com/SemiAnalysisAI/InferenceX/pull/XXX. The bare string TBD is the only such occurrence in the entire file. Since this PR is #932, the correct value is https://github.com/SemiAnalysisAI/InferenceX/pull/932.

One verifier noted that the Pydantic model defines pr_link as a plain str with no URL validation, so TBD does pass schema validation. This is accurate — the bug is not tooling-breaking. However, it is a clear inconsistency with every other entry and a straightforward oversight to fix.

Issue 2 — Single-node configs listed under multi-node sbatch description

The only code change in submit.sh adds --exclusive to the sbatch call in the multi-node disaggregated path. Yet the changelog entry lists 11 config keys, 7 of which are single-node configs that never invoke submit.sh:

  • dsr1-fp4-mi355x-atom, dsr1-fp4-mi355x-atom-mtp — use salloc in launch_mi355x-amds.sh (the else branch, confirmed in the diff)
  • dsr1-fp4-mi355x-sglang, dsr1-fp8-mi355x-sglang — likewise salloc in the else branch
  • gptoss-fp4-mi355x-atom, gptoss-fp4-mi355x-vllm, minimaxm2.5-fp8-mi355x-vllm — also single-node salloc path

Only the 4 -disagg configs (dsr1-fp4-mi355x-sglang-disagg, dsr1-fp4-mi355x-sglang-disagg-mtp, dsr1-fp8-mi355x-sglang-disagg, dsr1-fp8-mi355x-sglang-disagg-mtp) actually route through IS_MULTINODE=true into the sbatch path in submit.sh.

Step-by-step proof:

  1. In launch_mi355x-amds.sh, the top-level if [[ "$IS_MULTINODE" == "true" ]] branch calls bash "benchmarks/multi_node/${SCRIPT_NAME}", which invokes submit.sh containing the new --exclusive flag.
  2. All other configs fall into the else branch which calls salloc --partition=$PARTITION --gres=gpu:$TP --exclusive ... — this salloc --exclusive flag was already added by PR [AMD] Fix slurm command for AMD devices #929.
  3. The PR [AMD] Fix slurm command for AMD devices #929 entry at the bottom of perf-changelog.yaml uses "*mi355x*" and "*mi325x*" wildcards, which already match all single-node MI355X configs including every one incorrectly listed in the first entry.
  4. Therefore, including single-node configs in the first entry not only has an inaccurate description but also duplicates coverage already provided by the PR [AMD] Fix slurm command for AMD devices #929 entry.

Impact and fix:

Both issues are documentation-only — the actual --exclusive flag in submit.sh is correctly placed and functional. The fix is: (1) update pr-link: TBD to https://github.com/SemiAnalysisAI/InferenceX/pull/932, and (2) remove the 7 single-node config keys from the first entry, leaving only the 4 -disagg configs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

2 participants