Slurm
Slurm frequent commands
Table. List of common commands for Slurm job management
| Command | Description |
|---|---|
sbatch <script> |
Submit script file for execution |
scancel <jobid> |
Signal (kill) jobid for termination |
squeue --user <username> |
View information about jobs from an user |
sinfo |
View information about nodes and partitions |
sacct |
Display accounting data for running/past jobs |
Job submission commands
salloc, obtain a job allocation. Allows for subsequent execution of an application;srun, obtain a job allocation, and execute an application;sbatch, submit a batch script for later execution;
Table. Command line options for job submission commands.
| Option | Description |
|---|---|
--job-name=<name> |
Job name |
--account=<accountID> |
Account to be charged for resources used |
--begin=<yyyy-mm-dd> |
Initiate job after specified date/time |
--time=<dd-hh:mm:ss> |
Wall clock duration limit for the job |
--partition=<name> |
Partition/queue in which to run the job |
--nodes <minnodes[-maxnodes]> |
Minimum and maximum nodes required for the job |
--ntasks=<count> |
Number of tasks to be launched |
--ntasks-per-socket=<count> |
Number of tasks to be launched per socket |
--ntasks-per-node=<count> |
Number of tasks to be launched per node |
--mem=<MB> |
Memory required per node |
--mem-per-cpu=<MB> |
Memory required per CPU allocated |
--exclusive[=user] |
Allocated nodes can not be jobs/users |
--mail-user=<address> |
user email address |
--mail-type=<begin,end,...> |
E-mail notification type |
Controlling queued and running jobs
To suspend a job that is currently running on the system. This will stop a running job on its current step that can be resumed at a later time.
$ scontrol suspend <job_id>
To resume a paused job, we use scontrol with the resume command:
$ scontrol resume <job_id>
Slurm also provides a utility to hold jobs that are queued in the system. Holding a job will place the job in the lowest priority, effectively “holding” the job from being run. A job can only be held if it’s waiting on the system to be run.
$ scontrol hold <job_id>
We can then release a held job using the release command:
$ scontrol release <job_id>
Accounting commands
sacct, Display accounting data.sacctmgr, View and modify account information
Table. Command line options for the sacct command.
| Option | Description |
|---|---|
--user=<name> |
Displays accountings for a specific user |
--allusers |
Displays accounting for all users jobs |
--accounts=<accountIDs> |
Displays jobs with specified accounts |
--endtime=<yyyy-mm-dd> |
End of reporting period |
--format=<spec> |
Format output |
--starttime=<yyyy-mm-dd> |
Start of reporting period |
We can display info about jobs submitted in the last 4 days:
$ sacct --allocations --starttime=now-4days --format=jobid,Elapsed,NCPUS,State,WorkDir%70
Administration commands
Print all Slurm configuration details in slurm.conf (including defaults)
scontrol show config
Reconfiguration of services after changing configuration files:
# scontrol reconfigure
Reserve one node (creat reservation) for two users:
# scontrol create reservation reservationname="my-reservation" starttime=NOW \
duration=UNLIMITED flags=IGNORE_JOBS users="user1,user2" nodes=cn1
or a similar reservation starting on a specified date/time::
# scontrol create reservation reservationname="my-reservation" starttime=2025-09-24T14:00 \
duration=UNLIMITED flags=IGNORE_JOBS users="user1,user2" nodes=cn1
or for a specified number of nodes:
# scontrol create reservation ReservationName="dft-meeting" users="user1,user2" \
StartTime=2025-12-30T08:00:00 Duration=04:00:00 Flags=IGNORE_JOBS TRES=node=1
Check existing reservations
# scontrol show reservations
Job execution on reserved nodes should be issued as:
sbatch --reservation="my-reservation" slurm.sh
A reservation can be deleted with,
sudo scontrol delete ReservationName="my-reservation"
Reserve entire system for maintenance
# scontrol create reservation starttime=NOW \
duration=UNLIMITED user=root flags=maint,ignore_jobs nodes=ALL
Manage node activity:
# scontrol update NodeName=cn1,cn2 State=DRAIN Reason="Maintenance"
# scontrol update NodeName=cn1,cn2 State=RESUME Reason="Maintenance finished"
The synopsis for the sacctmgr command is:
[root@hn]# sacctmgr <options> <command>
where notable
--immediate, commits changes without asking for confirmation;--parsable, Pretty (tabular) output format
Table. Commands and options for the sacctmgr command.
Commands for <sacctmgr> |
Description |
|---|---|
reconfigure |
Reconfigures the SlurmDBD |
shutdown |
Shutdown the server |
create <entity> <specs> |
Create an entity |
remove <entity> where <specs> |
Delete entities |
show <entity> [<specs>] |
Display info about entities |
modify <entity> where <specs> set <specs> |
Modify entity |
Table. List of entities for the sacctmgr commabd.
| Entity | |
|---|---|
account |
bank account |
association |
Used to group information for list and show commands |
coordinator |
Usually an account manager |
event |
Events like downed or draining nodes |
job |
Used to modify specific fields of a job |
stats |
Used with list and show commands to view statistics |
tres |
Used with list and show commands to list Trackable RESources |
user |
Login user name |
List runaway (ghost) jobs and fix them. Runaway jobs are jobs that don't exist in the controller but are still considered running or pending in the database.
# sacctmgr show runaway jobs
Manage Slurm daemons and services:
# slurmctld -Dvvvv # Run Slurm control daemon in the foreground (Head Node)
# slurmdbd -Dvvvv # Run Slurm database daemon in the foreground (Head Node)
# slurmd -Dvvvv # Run Slurm daemon in the foreground (Compute Nodes)
# systemd status slurmctld # Check daemon status
# systemd stop slurmctld # Stop daemon
# systemd restart slurmctld # Restart daemon
# ssh cn1 systemd stop slurmd # Stop slurmd daemon via SSH on a Compute Node
# journalctl -xeu slurmctld # Print control daemon activity log