Revert "[AMD] Fix slurm command for AMD devices"#933
Conversation
This reverts commit d5a013b.
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you |
| LOCK_FILE="${SQUASH_FILE}.lock" | ||
|
|
||
| set -x | ||
| salloc --partition=$PARTITION --gres=gpu:$TP --exclusive --cpus-per-task=128 --time=180 --no-shell --job-name="$RUNNER_NAME" | ||
| salloc --partition=$PARTITION --gres=gpu:$TP --cpus-per-task=128 --time=180 --no-shell --job-name="$RUNNER_NAME" | ||
| JOB_ID=$(squeue --name="$RUNNER_NAME" -h -o %A | head -n1) | ||
|
|
||
| srun --jobid=$JOB_ID bash -c "docker stop \$(docker ps -a -q)" |
There was a problem hiding this comment.
🔴 Removing --exclusive from the MI355X salloc/sbatch commands re-introduces a destructive node-sharing hazard: the cleanup step (docker stop $(docker ps -a -q)) unconditionally kills all Docker containers on every allocated node, silently terminating other users workloads when SLURM shares nodes. This affects both the single-node path (launch_mi355x-amds.sh:165) and the multi-node path (job.slurm:333), and is especially likely to trigger for TP<8 configs like minimaxm2.5 (TP=2/TP=4) where a node with 8 GPUs can host multiple concurrent jobs. Re-add --exclusive to both salloc in launch_mi355x-amds.sh and sbatch in submit.sh, matching every other runner in the codebase.
Extended reasoning...
What the bug is
This PR reverts #929, which had added --exclusive to the MI355X SLURM allocation commands. Without --exclusive, SLURM is free to assign an MI355X node to multiple simultaneous jobs if there are enough free GPUs. On a node with 8 GPUs, a job requesting only TP=2 or TP=4 GPUs (via --gres=gpu:$TP) will leave 4-6 GPUs free, making the node eligible for co-scheduling with other users jobs.
The specific code path that triggers it
In the single-node path (runners/launch_mi355x-amds.sh), line 159 now reads:
salloc --partition=$PARTITION --gres=gpu:$TP --cpus-per-task=128 --time=180 --no-shell --job-name="$RUNNER_NAME"
Immediately after, line 165 runs:
srun --jobid=$JOB_ID bash -c "docker stop $(docker ps -a -q)"
docker ps -a -q has no --filter for container name, owner, or label - it lists every container on the node. docker stop then terminates all of them.
In the multi-node path, benchmarks/multi_node/amd_utils/submit.sh previously had --exclusive and now does not. job.slurm line 333 runs sudo docker ps -aq | xargs -r sudo docker stop || true on every allocated node, with the same total absence of ownership filtering. Note that line 332 does filter with --filter name=^container_sbatch_ for the docker rm step, making the asymmetry particularly stark: containers are selectively removed but indiscriminately stopped.
Why existing code does not prevent it
The only safeguard that prevented node sharing was --exclusive. There is no container label, cgroup namespace, or ownership check in the docker stop invocation. Other anti-collision measures (the flock-based squash file serialization, the PORT_OFFSET from RUNNER_NAME) address different problems and do nothing to prevent concurrent jobs from sharing a physical node.
What the impact would be
On any shared cluster where MI355X nodes host multiple jobs, a benchmark run with TP<8 will stop all Docker containers on its allocated node mid-run, without warning or error to the affected users. Their in-flight training, inference, or CI jobs simply disappear. This is data loss and work destruction for other cluster users, not merely a performance issue.
Step-by-step proof
- User A submits minimaxm2.5-fp8-mi355x-vllm with TP=2, requesting --gres=gpu:2 on a node with 8 GPUs.
- SLURM (without --exclusive) schedules the job on node42 which already has User B running a 6-GPU container workload.
- salloc succeeds; User A script reaches line 165 and executes docker stop against all containers on node42.
- docker ps -a -q returns the IDs of User B containers (there are no User A containers yet at this point).
- docker stop terminates User B containers immediately.
- User A is unaware anything went wrong; User B job fails silently with no notification.
How to fix it
Re-add --exclusive to both places removed by this PR:
- runners/launch_mi355x-amds.sh: restore --exclusive in the salloc invocation (line ~159).
- benchmarks/multi_node/amd_utils/submit.sh: restore --exclusive in the sbatch_cmd array.
This matches every other runner in the repository (launch_h200-nv.sh, launch_b200-nv.sh, launch_h100-cw.sh, launch_h200-cw.sh, launch_h100-dgxc-slurm.sh, launch_h200-dgxc-slurm.sh, launch_b200-dgxc-slurm.sh, launch_h200-nb.sh, launch_b200-nb.sh) which all use --exclusive precisely because they also run indiscriminate docker stop cleanup steps.
Reverts #929