Skip to content

Revert "[AMD] Fix slurm command for AMD devices"#933

Merged
cquil11 merged 1 commit intomainfrom
revert-929-srok/srun_mi355x_fix
Mar 23, 2026
Merged

Revert "[AMD] Fix slurm command for AMD devices"#933
cquil11 merged 1 commit intomainfrom
revert-929-srok/srun_mi355x_fix

Conversation

@cquil11
Copy link
Collaborator

@cquil11 cquil11 commented Mar 23, 2026

Reverts #929

@cquil11 cquil11 requested a review from a team March 23, 2026 15:17
@github-actions
Copy link
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

@cquil11 cquil11 merged commit cc9c83c into main Mar 23, 2026
3 checks passed
@cquil11 cquil11 deleted the revert-929-srok/srun_mi355x_fix branch March 23, 2026 15:17
Comment on lines 159 to 165
LOCK_FILE="${SQUASH_FILE}.lock"

set -x
salloc --partition=$PARTITION --gres=gpu:$TP --exclusive --cpus-per-task=128 --time=180 --no-shell --job-name="$RUNNER_NAME"
salloc --partition=$PARTITION --gres=gpu:$TP --cpus-per-task=128 --time=180 --no-shell --job-name="$RUNNER_NAME"
JOB_ID=$(squeue --name="$RUNNER_NAME" -h -o %A | head -n1)

srun --jobid=$JOB_ID bash -c "docker stop \$(docker ps -a -q)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Removing --exclusive from the MI355X salloc/sbatch commands re-introduces a destructive node-sharing hazard: the cleanup step (docker stop $(docker ps -a -q)) unconditionally kills all Docker containers on every allocated node, silently terminating other users workloads when SLURM shares nodes. This affects both the single-node path (launch_mi355x-amds.sh:165) and the multi-node path (job.slurm:333), and is especially likely to trigger for TP<8 configs like minimaxm2.5 (TP=2/TP=4) where a node with 8 GPUs can host multiple concurrent jobs. Re-add --exclusive to both salloc in launch_mi355x-amds.sh and sbatch in submit.sh, matching every other runner in the codebase.

Extended reasoning...

What the bug is

This PR reverts #929, which had added --exclusive to the MI355X SLURM allocation commands. Without --exclusive, SLURM is free to assign an MI355X node to multiple simultaneous jobs if there are enough free GPUs. On a node with 8 GPUs, a job requesting only TP=2 or TP=4 GPUs (via --gres=gpu:$TP) will leave 4-6 GPUs free, making the node eligible for co-scheduling with other users jobs.

The specific code path that triggers it

In the single-node path (runners/launch_mi355x-amds.sh), line 159 now reads:

salloc --partition=$PARTITION --gres=gpu:$TP --cpus-per-task=128 --time=180 --no-shell --job-name="$RUNNER_NAME"

Immediately after, line 165 runs:

srun --jobid=$JOB_ID bash -c "docker stop $(docker ps -a -q)"

docker ps -a -q has no --filter for container name, owner, or label - it lists every container on the node. docker stop then terminates all of them.

In the multi-node path, benchmarks/multi_node/amd_utils/submit.sh previously had --exclusive and now does not. job.slurm line 333 runs sudo docker ps -aq | xargs -r sudo docker stop || true on every allocated node, with the same total absence of ownership filtering. Note that line 332 does filter with --filter name=^container_sbatch_ for the docker rm step, making the asymmetry particularly stark: containers are selectively removed but indiscriminately stopped.

Why existing code does not prevent it

The only safeguard that prevented node sharing was --exclusive. There is no container label, cgroup namespace, or ownership check in the docker stop invocation. Other anti-collision measures (the flock-based squash file serialization, the PORT_OFFSET from RUNNER_NAME) address different problems and do nothing to prevent concurrent jobs from sharing a physical node.

What the impact would be

On any shared cluster where MI355X nodes host multiple jobs, a benchmark run with TP<8 will stop all Docker containers on its allocated node mid-run, without warning or error to the affected users. Their in-flight training, inference, or CI jobs simply disappear. This is data loss and work destruction for other cluster users, not merely a performance issue.

Step-by-step proof

  1. User A submits minimaxm2.5-fp8-mi355x-vllm with TP=2, requesting --gres=gpu:2 on a node with 8 GPUs.
  2. SLURM (without --exclusive) schedules the job on node42 which already has User B running a 6-GPU container workload.
  3. salloc succeeds; User A script reaches line 165 and executes docker stop against all containers on node42.
  4. docker ps -a -q returns the IDs of User B containers (there are no User A containers yet at this point).
  5. docker stop terminates User B containers immediately.
  6. User A is unaware anything went wrong; User B job fails silently with no notification.

How to fix it

Re-add --exclusive to both places removed by this PR:

  • runners/launch_mi355x-amds.sh: restore --exclusive in the salloc invocation (line ~159).
  • benchmarks/multi_node/amd_utils/submit.sh: restore --exclusive in the sbatch_cmd array.

This matches every other runner in the repository (launch_h200-nv.sh, launch_b200-nv.sh, launch_h100-cw.sh, launch_h200-cw.sh, launch_h100-dgxc-slurm.sh, launch_h200-dgxc-slurm.sh, launch_b200-dgxc-slurm.sh, launch_h200-nb.sh, launch_b200-nb.sh) which all use --exclusive precisely because they also run indiscriminate docker stop cleanup steps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

1 participant