fix: add --exclusive to MI355X multi-node sbatch for accurate benchmarks by cquil11 · Pull Request #932 · SemiAnalysisAI/InferenceX

cquil11 · 2026-03-23T14:57:38Z

Summary

Add --exclusive flag to sbatch in benchmarks/multi_node/amd_utils/submit.sh to prevent node sharing during multi-node MI355X benchmarks
Only non-TP8 configs are affected (TP8 already uses all GPUs); perf-changelog updated for all 11 non-TP8 MI355X configs
Single-node MI355X salloc already has --exclusive; this brings the multi-node path to parity

Test plan

Run a non-TP8 MI355X multi-node disagg benchmark (e.g. dsr1-fp8-mi355x-sglang-disagg TP4) and verify results are consistent with exclusive node access

🤖 Generated with Claude Code

Signed-off-by: seungrokj <seungrok.jung@amd.com>

Without --exclusive, SLURM can co-schedule other jobs on the same nodes, causing memory bandwidth, XGMI/Infinity Fabric, and NUMA contention that degrades benchmark results — especially for non-TP8 disaggregated serving configs where cross-node RDMA traffic is latency-sensitive. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cquil11 · 2026-03-23T15:00:22Z

Closing in favor of #929 which already covers MI355X.

claude · 2026-03-23T15:20:11Z

perf-changelog.yaml

+- config-keys:
+    - dsr1-fp4-mi355x-atom
+    - dsr1-fp4-mi355x-atom-mtp
+    - dsr1-fp4-mi355x-sglang
+    - dsr1-fp4-mi355x-sglang-disagg
+    - dsr1-fp4-mi355x-sglang-disagg-mtp
+    - dsr1-fp8-mi355x-sglang
+    - dsr1-fp8-mi355x-sglang-disagg
+    - dsr1-fp8-mi355x-sglang-disagg-mtp
+    - gptoss-fp4-mi355x-atom
+    - gptoss-fp4-mi355x-vllm
+    - minimaxm2.5-fp8-mi355x-vllm
+  description:
+    - "Add --exclusive flag to MI355X multi-node sbatch to prevent node sharing during benchmarks"
+    - "Only non-TP8 configs are affected; TP8 already uses all GPUs on the node"
+  pr-link: TBD
+


🟡 The first changelog entry has two documentation inaccuracies: pr-link: TBD should be https://github.com/SemiAnalysisAI/InferenceX/pull/932, and the entry description ("multi-node sbatch") incorrectly lists 7 single-node configs (dsr1-fp4-mi355x-atom, dsr1-fp4-mi355x-atom-mtp, dsr1-fp4-mi355x-sglang, dsr1-fp8-mi355x-sglang, gptoss-fp4-mi355x-atom, gptoss-fp4-mi355x-vllm, minimaxm2.5-fp8-mi355x-vllm) that route through salloc, not sbatch. The config-keys list should contain only the 4 -disagg variants affected by the submit.sh change; the single-node configs are already covered by the PR #929 entry at the bottom of the file.

Extended reasoning...

Issue 1 — pr-link: TBD placeholder

The first changelog entry sets pr-link: TBD (line 16) instead of the actual PR URL. The project convention visible throughout the file is either a full URL (https://github.com/SemiAnalysisAI/InferenceX/pull/NNN) or the in-progress placeholder https://github.com/SemiAnalysisAI/InferenceX/pull/XXX. The bare string TBD is the only such occurrence in the entire file. Since this PR is #932, the correct value is https://github.com/SemiAnalysisAI/InferenceX/pull/932.

One verifier noted that the Pydantic model defines pr_link as a plain str with no URL validation, so TBD does pass schema validation. This is accurate — the bug is not tooling-breaking. However, it is a clear inconsistency with every other entry and a straightforward oversight to fix.

Issue 2 — Single-node configs listed under multi-node sbatch description

The only code change in submit.sh adds --exclusive to the sbatch call in the multi-node disaggregated path. Yet the changelog entry lists 11 config keys, 7 of which are single-node configs that never invoke submit.sh:

dsr1-fp4-mi355x-atom, dsr1-fp4-mi355x-atom-mtp — use salloc in launch_mi355x-amds.sh (the else branch, confirmed in the diff)

dsr1-fp4-mi355x-sglang, dsr1-fp8-mi355x-sglang — likewise salloc in the else branch

gptoss-fp4-mi355x-atom, gptoss-fp4-mi355x-vllm, minimaxm2.5-fp8-mi355x-vllm — also single-node salloc path

Only the 4 -disagg configs (dsr1-fp4-mi355x-sglang-disagg, dsr1-fp4-mi355x-sglang-disagg-mtp, dsr1-fp8-mi355x-sglang-disagg, dsr1-fp8-mi355x-sglang-disagg-mtp) actually route through IS_MULTINODE=true into the sbatch path in submit.sh.

Step-by-step proof:

In launch_mi355x-amds.sh, the top-level if [[ "$IS_MULTINODE" == "true" ]] branch calls bash "benchmarks/multi_node/${SCRIPT_NAME}", which invokes submit.sh containing the new --exclusive flag.

All other configs fall into the else branch which calls salloc --partition=$PARTITION --gres=gpu:$TP --exclusive ... — this salloc --exclusive flag was already added by PR [AMD] Fix slurm command for AMD devices #929.

The PR [AMD] Fix slurm command for AMD devices #929 entry at the bottom of perf-changelog.yaml uses "*mi355x*" and "*mi325x*" wildcards, which already match all single-node MI355X configs including every one incorrectly listed in the first entry.

Therefore, including single-node configs in the first entry not only has an inaccurate description but also duplicates coverage already provided by the PR [AMD] Fix slurm command for AMD devices #929 entry.

Impact and fix:

Both issues are documentation-only — the actual --exclusive flag in submit.sh is correctly placed and functional. The fix is: (1) update pr-link: TBD to https://github.com/SemiAnalysisAI/InferenceX/pull/932, and (2) remove the 7 single-node config keys from the first entry, leaving only the 4 -disagg configs.

seungrokj and others added 4 commits March 23, 2026 13:56

slurm command fix

13dd1d4

Signed-off-by: seungrokj <seungrok.jung@amd.com>

update perf changelog

e4f812f

add --exclusive to mi355x multinode

2657489

cquil11 requested a review from a team March 23, 2026 14:57

github-project-automation bot added this to InferenceMAX Board Mar 23, 2026

cquil11 closed this Mar 23, 2026

github-project-automation bot moved this to Done in InferenceMAX Board Mar 23, 2026

claude bot reviewed Mar 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add --exclusive to MI355X multi-node sbatch for accurate benchmarks#932

fix: add --exclusive to MI355X multi-node sbatch for accurate benchmarks#932
cquil11 wants to merge 4 commits intomainfrom
fix/mi355x-exclusive

cquil11 commented Mar 23, 2026

Uh oh!

cquil11 commented Mar 23, 2026

Uh oh!

claude bot Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cquil11 commented Mar 23, 2026

Summary

Test plan

Uh oh!

cquil11 commented Mar 23, 2026

Uh oh!

claude bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants