[AMD] Fix slurm command for AMD devices by seungrokj · Pull Request #929 · SemiAnalysisAI/InferenceX

seungrokj · 2026-03-23T14:00:38Z

Motivation is AMD slurm clusters are missing --exclusive option and they showed regressed performance. NV slurm clusters are already using --exclusive option.

without --exclusive
https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23433485252

with --exclusive
https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23439928217

Regards,
Seungrok

Signed-off-by: seungrokj <seungrok.jung@amd.com>

claude

LGTM — straightforward addition of --exclusive to salloc across all three AMD runner scripts, consistent with the existing NV cluster scripts.

Extended reasoning...

Overview

This PR modifies three AMD Slurm launch scripts (launch_mi300x-amds.sh, launch_mi325x-amd.sh, launch_mi355x-amds.sh) to add --exclusive to their salloc calls. The change is identical in nature across all three files: one flag inserted into an existing command line.

Security Risks

No security-sensitive code is touched. These are CI runner scripts for benchmark execution on dedicated hardware. No auth, crypto, permissions, or external service calls are modified.

Level of Scrutiny

This is a low-risk, mechanical change. The --exclusive flag is already used in the NV cluster scripts, and the PR provides clear benchmark evidence showing significant performance improvement (TPOT ~22ms vs ~38ms with --exclusive on MI355X). The change is consistent with established patterns in the codebase.

Other Factors

The flagged bugs (missing -u $USER in squeue and no empty JOB_ID guard in the mi355x single-node path) are pre-existing issues not introduced by this PR. They are noted as inline comments for the team to address separately. The core --exclusive addition is correct and self-contained.

runners/launch_mi355x-amds.sh

cquil11

LGTM

cquil11 · 2026-03-23T15:02:52Z

see also:
MI325X: #931
MI300X (pending hardware): #930

MI325X now has its own PR (#931). Update perf-changelog to list specific non-TP8 MI355X configs instead of wildcards. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

This reverts commit d5a013b.

…sweep] This reverts commit d5a013b.

Re-opening of #929 which was reverted. Add --exclusive flag to MI355X salloc (single-node) and sbatch (multi-node) to prevent node sharing during benchmarks, matching NVIDIA runner behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [AMD] Fix slurm command for AMD devices Re-opening of #929 which was reverted. Add --exclusive flag to MI355X salloc (single-node) and sbatch (multi-node) to prevent node sharing during benchmarks, matching NVIDIA runner behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Update perf-changelog.yaml --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

slurm command fix

13dd1d4

Signed-off-by: seungrokj <seungrok.jung@amd.com>

seungrokj requested a review from a team March 23, 2026 14:00

github-project-automation bot added this to InferenceMAX Board Mar 23, 2026

claude bot reviewed Mar 23, 2026

View reviewed changes

runners/launch_mi355x-amds.sh Show resolved Hide resolved

update perf changelog

e4f812f

cquil11 added the sweep-enabled label Mar 23, 2026

cquil11 added 2 commits March 23, 2026 09:21

add --exclusive to mi355x multinode

2657489

Merge branch 'main' into srok/srun_mi355x_fix

4a0d50d

cquil11 approved these changes Mar 23, 2026

View reviewed changes

cquil11 mentioned this pull request Mar 23, 2026

fix: add --exclusive to MI355X multi-node sbatch for accurate benchmarks #932

Closed

1 task

cquil11 and others added 4 commits March 23, 2026 10:04

fix: scope PR to MI355X only, remove MI325X changes

c8eae19

MI325X now has its own PR (#931). Update perf-changelog to list specific non-TP8 MI355X configs instead of wildcards. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chore: clarify perf-changelog description for MI355X --exclusive

60b418c

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Update perf-changelog.yaml

7650e06

Merge branch 'main' into srok/srun_mi355x_fix

bca1293

cquil11 merged commit d5a013b into main Mar 23, 2026

cquil11 deleted the srok/srun_mi355x_fix branch March 23, 2026 15:10

github-project-automation bot moved this to Done in InferenceMAX Board Mar 23, 2026

cquil11 added a commit that referenced this pull request Mar 23, 2026

Revert "[AMD] Fix slurm command for AMD devices (#929)"

4958ff8

This reverts commit d5a013b.

cquil11 mentioned this pull request Mar 23, 2026

Revert "[AMD] Fix slurm command for AMD devices" #933

Merged

cquil11 added a commit that referenced this pull request Mar 23, 2026

Revert "[AMD] Fix slurm command for AMD devices (#929)" (#933) [skip-…

cc9c83c

…sweep] This reverts commit d5a013b.

cquil11 mentioned this pull request Mar 23, 2026

[AMD] Fix slurm command for AMD devices #934

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] Fix slurm command for AMD devices#929

[AMD] Fix slurm command for AMD devices#929
cquil11 merged 8 commits intomainfrom
srok/srun_mi355x_fix

seungrokj commented Mar 23, 2026 •

edited

Loading

Uh oh!

claude bot left a comment

Uh oh!

Uh oh!

cquil11 left a comment

Uh oh!

cquil11 commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

seungrokj commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Overview

Security Risks

Level of Scrutiny

Other Factors

Uh oh!

Uh oh!

cquil11 left a comment

Choose a reason for hiding this comment

Uh oh!

cquil11 commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

seungrokj commented Mar 23, 2026 •

edited

Loading