Skip to content

[AMD] Fix slurm command for AMD devices#929

Merged
cquil11 merged 8 commits intomainfrom
srok/srun_mi355x_fix
Mar 23, 2026
Merged

[AMD] Fix slurm command for AMD devices#929
cquil11 merged 8 commits intomainfrom
srok/srun_mi355x_fix

Conversation

@seungrokj
Copy link
Collaborator

@seungrokj seungrokj commented Mar 23, 2026

Hi @cquil11 @functionstackx

Motivation is AMD slurm clusters are missing --exclusive option and they showed regressed performance. NV slurm clusters are already using --exclusive option.

without --exclusive
https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23433485252
image

with --exclusive
https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23439928217
image

Regards,
Seungrok

Signed-off-by: seungrokj <seungrok.jung@amd.com>
Copy link
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — straightforward addition of --exclusive to salloc across all three AMD runner scripts, consistent with the existing NV cluster scripts.

Extended reasoning...

Overview

This PR modifies three AMD Slurm launch scripts (launch_mi300x-amds.sh, launch_mi325x-amd.sh, launch_mi355x-amds.sh) to add --exclusive to their salloc calls. The change is identical in nature across all three files: one flag inserted into an existing command line.

Security Risks

No security-sensitive code is touched. These are CI runner scripts for benchmark execution on dedicated hardware. No auth, crypto, permissions, or external service calls are modified.

Level of Scrutiny

This is a low-risk, mechanical change. The --exclusive flag is already used in the NV cluster scripts, and the PR provides clear benchmark evidence showing significant performance improvement (TPOT ~22ms vs ~38ms with --exclusive on MI355X). The change is consistent with established patterns in the codebase.

Other Factors

The flagged bugs (missing -u $USER in squeue and no empty JOB_ID guard in the mi355x single-node path) are pre-existing issues not introduced by this PR. They are noted as inline comments for the team to address separately. The core --exclusive addition is correct and self-contained.

Copy link
Collaborator

@cquil11 cquil11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cquil11
Copy link
Collaborator

cquil11 commented Mar 23, 2026

see also:
MI325X: #931
MI300X (pending hardware): #930

cquil11 and others added 4 commits March 23, 2026 10:04
MI325X now has its own PR (#931). Update perf-changelog to list
specific non-TP8 MI355X configs instead of wildcards.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cquil11 cquil11 merged commit d5a013b into main Mar 23, 2026
@cquil11 cquil11 deleted the srok/srun_mi355x_fix branch March 23, 2026 15:10
cquil11 added a commit that referenced this pull request Mar 23, 2026
cquil11 added a commit that referenced this pull request Mar 23, 2026
cquil11 added a commit that referenced this pull request Mar 23, 2026
Re-opening of #929 which was reverted.

Add --exclusive flag to MI355X salloc (single-node) and sbatch
(multi-node) to prevent node sharing during benchmarks, matching
NVIDIA runner behavior.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cquil11 added a commit that referenced this pull request Mar 23, 2026
* [AMD] Fix slurm command for AMD devices

Re-opening of #929 which was reverted.

Add --exclusive flag to MI355X salloc (single-node) and sbatch
(multi-node) to prevent node sharing during benchmarks, matching
NVIDIA runner behavior.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Update perf-changelog.yaml

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

2 participants