Revert "[AMD] Fix slurm command for AMD devices" by cquil11 · Pull Request #933 · SemiAnalysisAI/InferenceX

cquil11 · 2026-03-23T15:17:29Z

Reverts #929

This reverts commit d5a013b.

github-actions · 2026-03-23T15:17:42Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

claude · 2026-03-23T15:29:48Z

runners/launch_mi355x-amds.sh

    LOCK_FILE="${SQUASH_FILE}.lock"

    set -x
-    salloc --partition=$PARTITION --gres=gpu:$TP --exclusive --cpus-per-task=128 --time=180 --no-shell --job-name="$RUNNER_NAME"
+    salloc --partition=$PARTITION --gres=gpu:$TP --cpus-per-task=128 --time=180 --no-shell --job-name="$RUNNER_NAME"
    JOB_ID=$(squeue --name="$RUNNER_NAME" -h -o %A | head -n1)

    srun --jobid=$JOB_ID bash -c "docker stop \$(docker ps -a -q)"


🔴 Removing --exclusive from the MI355X salloc/sbatch commands re-introduces a destructive node-sharing hazard: the cleanup step (docker stop $(docker ps -a -q)) unconditionally kills all Docker containers on every allocated node, silently terminating other users workloads when SLURM shares nodes. This affects both the single-node path (launch_mi355x-amds.sh:165) and the multi-node path (job.slurm:333), and is especially likely to trigger for TP<8 configs like minimaxm2.5 (TP=2/TP=4) where a node with 8 GPUs can host multiple concurrent jobs. Re-add --exclusive to both salloc in launch_mi355x-amds.sh and sbatch in submit.sh, matching every other runner in the codebase.

Extended reasoning...

What the bug is

This PR reverts #929, which had added --exclusive to the MI355X SLURM allocation commands. Without --exclusive, SLURM is free to assign an MI355X node to multiple simultaneous jobs if there are enough free GPUs. On a node with 8 GPUs, a job requesting only TP=2 or TP=4 GPUs (via --gres=gpu:$TP) will leave 4-6 GPUs free, making the node eligible for co-scheduling with other users jobs.

The specific code path that triggers it

In the single-node path (runners/launch_mi355x-amds.sh), line 159 now reads:

salloc --partition=$PARTITION --gres=gpu:$TP --cpus-per-task=128 --time=180 --no-shell --job-name="$RUNNER_NAME"

Immediately after, line 165 runs:

srun --jobid=$JOB_ID bash -c "docker stop $(docker ps -a -q)"

docker ps -a -q has no --filter for container name, owner, or label - it lists every container on the node. docker stop then terminates all of them.

In the multi-node path, benchmarks/multi_node/amd_utils/submit.sh previously had --exclusive and now does not. job.slurm line 333 runs sudo docker ps -aq | xargs -r sudo docker stop || true on every allocated node, with the same total absence of ownership filtering. Note that line 332 does filter with --filter name=^container_sbatch_ for the docker rm step, making the asymmetry particularly stark: containers are selectively removed but indiscriminately stopped.

Why existing code does not prevent it

The only safeguard that prevented node sharing was --exclusive. There is no container label, cgroup namespace, or ownership check in the docker stop invocation. Other anti-collision measures (the flock-based squash file serialization, the PORT_OFFSET from RUNNER_NAME) address different problems and do nothing to prevent concurrent jobs from sharing a physical node.

What the impact would be

On any shared cluster where MI355X nodes host multiple jobs, a benchmark run with TP<8 will stop all Docker containers on its allocated node mid-run, without warning or error to the affected users. Their in-flight training, inference, or CI jobs simply disappear. This is data loss and work destruction for other cluster users, not merely a performance issue.

Step-by-step proof

User A submits minimaxm2.5-fp8-mi355x-vllm with TP=2, requesting --gres=gpu:2 on a node with 8 GPUs.

SLURM (without --exclusive) schedules the job on node42 which already has User B running a 6-GPU container workload.

salloc succeeds; User A script reaches line 165 and executes docker stop against all containers on node42.

docker ps -a -q returns the IDs of User B containers (there are no User A containers yet at this point).

docker stop terminates User B containers immediately.

User A is unaware anything went wrong; User B job fails silently with no notification.

How to fix it

Re-add --exclusive to both places removed by this PR:

runners/launch_mi355x-amds.sh: restore --exclusive in the salloc invocation (line ~159).

benchmarks/multi_node/amd_utils/submit.sh: restore --exclusive in the sbatch_cmd array.

This matches every other runner in the repository (launch_h200-nv.sh, launch_b200-nv.sh, launch_h100-cw.sh, launch_h200-cw.sh, launch_h100-dgxc-slurm.sh, launch_h200-dgxc-slurm.sh, launch_b200-dgxc-slurm.sh, launch_h200-nb.sh, launch_b200-nb.sh) which all use --exclusive precisely because they also run indiscriminate docker stop cleanup steps.

Revert "[AMD] Fix slurm command for AMD devices (#929)"

4958ff8

This reverts commit d5a013b.

cquil11 requested a review from a team March 23, 2026 15:17

github-project-automation bot added this to InferenceMAX Board Mar 23, 2026

cquil11 merged commit cc9c83c into main Mar 23, 2026
3 checks passed

cquil11 deleted the revert-929-srok/srun_mi355x_fix branch March 23, 2026 15:17

github-project-automation bot moved this to Done in InferenceMAX Board Mar 23, 2026

claude bot reviewed Mar 23, 2026

View reviewed changes

claude bot mentioned this pull request Mar 23, 2026

[AMD] Fix slurm command for AMD devices #934

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert "[AMD] Fix slurm command for AMD devices"#933

Revert "[AMD] Fix slurm command for AMD devices"#933
cquil11 merged 1 commit intomainfrom
revert-929-srok/srun_mi355x_fix

cquil11 commented Mar 23, 2026

Uh oh!

github-actions bot commented Mar 23, 2026

Uh oh!

Uh oh!

claude bot Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cquil11 commented Mar 23, 2026

Uh oh!

github-actions bot commented Mar 23, 2026

Uh oh!

Uh oh!

claude bot Mar 23, 2026

Choose a reason for hiding this comment

What the bug is

The specific code path that triggers it

Why existing code does not prevent it

What the impact would be

Step-by-step proof

How to fix it

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant