Skip to content

[AMD] Fix slurm command for AMD devices#934

Merged
cquil11 merged 2 commits intomainfrom
srok/srun_mi355x_fix_v2
Mar 23, 2026
Merged

[AMD] Fix slurm command for AMD devices#934
cquil11 merged 2 commits intomainfrom
srok/srun_mi355x_fix_v2

Conversation

@cquil11
Copy link
Collaborator

@cquil11 cquil11 commented Mar 23, 2026

Note: Re-opening of #929 which was reverted.

Hi @cquil11 @functionstackx

Motivation is AMD slurm clusters are missing --exclusive option and they showed regressed performance. NV slurm clusters are already using --exclusive option.

without --exclusive
https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23433485252
image

with --exclusive
https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23439928217
image

Regards,
Seungrok

Re-opening of #929 which was reverted.

Add --exclusive flag to MI355X salloc (single-node) and sbatch
(multi-node) to prevent node sharing during benchmarks, matching
NVIDIA runner behavior.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cquil11 cquil11 requested a review from a team March 23, 2026 15:19
@github-actions
Copy link
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

@cquil11
Copy link
Collaborator Author

cquil11 commented Mar 23, 2026

@seungrokj had to re open

@cquil11 cquil11 merged commit 42eb961 into main Mar 23, 2026
13 checks passed
@cquil11 cquil11 deleted the srok/srun_mi355x_fix_v2 branch March 23, 2026 15:21

set -x
salloc --partition=$PARTITION --gres=gpu:$TP --cpus-per-task=128 --time=180 --no-shell --job-name="$RUNNER_NAME"
salloc --partition=$PARTITION --gres=gpu:$TP --exclusive --cpus-per-task=128 --time=180 --no-shell --job-name="$RUNNER_NAME"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 This PR fixes --exclusive for MI355X salloc but leaves MI325X (runners/launch_mi325x-amd.sh) and MI300X (runners/launch_mi300x-amds.sh) without the flag, making the fix incomplete relative to the PR title "[AMD] Fix slurm command for AMD devices". The already-merged perf-changelog from PR #931 explicitly documents that --exclusive on MI325X salloc prevents node sharing, but the corresponding code change was never applied, creating a contradiction between documented behavior and actual behavior.

Extended reasoning...

What the bug is and how it manifests

This PR re-adds --exclusive to runners/launch_mi355x-amds.sh line 162 and benchmarks/multi_node/amd_utils/submit.sh, but the analogous fix is absent from runners/launch_mi325x-amd.sh and runners/launch_mi300x-amds.sh. Both of those files still contain salloc invocations without --exclusive, meaning benchmarks on MI325X and MI300X can share nodes with other jobs, leading to the same performance regression the PR title claims to fix.

The specific code path that triggers it

For MI325X single-node benchmarks, the salloc on line 12 of runners/launch_mi325x-amd.sh runs without --exclusive. The SLURM scheduler can therefore co-schedule other workloads on the same node during the benchmark run, introducing unpredictable contention for GPUs, memory bandwidth, and NIC resources. MI300X is in the same situation. Neither runner has an IS_MULTINODE branch that would receive the sbatch --exclusive fix from submit.sh.

Why existing code does not prevent it

The git history makes the gap clear: PR #929 added --exclusive to all AMD runners; PR #933 reverted it entirely. PR #931 (already merged and not reverted) added a perf-changelog entry stating "Document --exclusive flag on MI325X salloc prevents node sharing during benchmarks" and "May yield performance improvements" - but that PR was a chore commit that only touched perf-changelog.yaml, never the runner scripts. This PR (#934) re-adds the MI355X piece but omits MI325X and MI300X, so the gap between documentation and code persists.

Addressing the refutation

One verifier argues the omission is intentional, citing a commit from PR #929 that said "scope PR to MI355X only, remove MI325X changes. MI325X now has its own PR (#931)." However, PR #931 never added --exclusive to launch_mi325x-amd.sh - it was a documentation-only chore PR. The intent to fix MI325X in a dedicated PR was stated, but the implementation was never delivered. As a result, the merged perf-changelog now describes behavior that does not exist in the codebase, and this PR (which is explicitly a re-open of #929 with the goal of fixing AMD slurm clusters) is a natural place to complete the work.

Impact

The PR description shows benchmark screenshots demonstrating measurable performance regression without --exclusive. The perf-changelog for PR #931 documents "May yield performance improvements" specifically for gptoss-fp4-mi325x-vllm and minimaxm2.5-fp8-mi325x-vllm. Leaving MI325X and MI300X without --exclusive means those benchmarks may continue to run under node-sharing conditions, producing results that are neither reproducible nor comparable to baselines.

Step-by-step proof

  1. PR chore: add perf-changelog for MI325X --exclusive flag impact #931 merged commit 7468fb3: perf-changelog.yaml gains entry claiming --exclusive on MI325X salloc prevents node sharing.
  2. The actual launch_mi325x-amd.sh line 12 still reads: salloc --partition=$PARTITION --gres=gpu:$TP --cpus-per-task=256 --time=480 --no-shell --job-name="$RUNNER_NAME" - no --exclusive.
  3. This PR adds --exclusive to launch_mi355x-amds.sh line 162 (visible in the diff) but does not touch launch_mi325x-amd.sh or launch_mi300x-amds.sh.
  4. Any MI325X single-node benchmark run via launch_mi325x-amd.sh is still susceptible to SLURM node sharing, contradicting both the PR title and the merged perf-changelog.

How to fix it

Add --exclusive to the salloc invocation in runners/launch_mi325x-amd.sh and runners/launch_mi300x-amds.sh, mirroring the change already made in runners/launch_mi355x-amds.sh.

Comment on lines +1038 to +1041
- gptoss-fp4-mi355x-vllm
- minimaxm2.5-fp8-mi355x-vllm
description:
- "Add --exclusive flag to MI355X single-node salloc and multi-node sbatch to prevent node sharing during benchmarks"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The perf-changelog description "Only non-TP8 configs listed; TP8 already uses all GPUs on the node" is factually inaccurate — several listed configs are TP=8 (e.g., dsr1-fp4-mi355x-atom was explicitly consolidated to TP=8-only in PR #699). The prior PR #931 entry correctly said "Only non-TP8 configs are affected", but changing "affected" to "listed" falsely implies TP8 configs were excluded from the list.

Extended reasoning...

What the Bug Is

The new perf-changelog entry describes the --exclusive flag change with: "Only non-TP8 configs listed; TP8 already uses all GPUs on the node." This differs meaningfully from the PR #931 entry it was based on, which read: "Only non-TP8 configs are affected; TP8 already uses all GPUs on the node."

The Specific Issue

The difference is one word: "affected" (PR #931) vs "listed" (this PR). In PR #931, "affected" correctly meant the --exclusive flag only has a performance impact for non-TP8 sub-configurations, since TP8 already occupies all 8 GPUs making --exclusive redundant for them. "Listed" instead implies only non-TP8 config-keys appear in this entry, which is false.

Why the Description Is Inaccurate

The listed config-keys clearly include TP=8 configurations:

  • dsr1-fp4-mi355x-atom and dsr1-fp4-mi355x-atom-mtp: PR [AMD] Update search-space for ATOM DSR1 configs #699 perf-changelog entry explicitly states "Comment out TP=4 configs, consolidate to TP=8 only" — this config has no non-TP8 variants whatsoever.
  • dsr1-fp8-mi355x-sglang: Search space includes tp:8 entries for 1k1k and 1k8k.
  • dsr1-fp4-mi355x-sglang: Has both tp:4 and tp:8 entries.
  • gptoss-fp4-mi355x-atom, gptoss-fp4-mi355x-vllm: Include tp:8 in their search spaces.

One refutation argued that all listed configs have non-TP8 variants and the description is just imprecisely worded. However, this fails for dsr1-fp4-mi355x-atom which was explicitly consolidated to TP=8 only with no non-TP8 variants remaining. Furthermore, "Only non-TP8 configs listed" in plain English means "the listed configs are non-TP8," not "configs that happen to have some non-TP8 sub-configurations."

Step-by-Step Proof

  1. PR chore: add perf-changelog for MI325X --exclusive flag impact #931 uses: "Only non-TP8 configs are affected" — correct, --exclusive helps only non-TP8 runs.
  2. This PR copies that language but changes "affected" to "listed," making it read: "Only non-TP8 configs [are] listed."
  3. Check dsr1-fp4-mi355x-atom: PR [AMD] Update search-space for ATOM DSR1 configs #699 entry says "Comment out TP=4 configs, consolidate to TP=8 only" — TP=8-only, no non-TP8 variants.
  4. dsr1-fp4-mi355x-atom appears in the new entry's config-keys list.
  5. Conclusion: a TP=8-only config is listed, directly contradicting "only non-TP8 configs listed."

Impact

Documentation-only inaccuracy with no functional impact on the correct --exclusive flag change. However, readers consulting the perf-changelog to understand which configurations benefit from --exclusive would be misled into thinking TP=8 configs were intentionally excluded.

How to Fix

Change "Only non-TP8 configs listed" back to "Only non-TP8 configs are affected" (matching PR #931 intent), making clear this notes that TP8 does not benefit from --exclusive — not that TP8 configs are absent from the list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

2 participants