Slurm

Slurm frequent commands

Table. List of common commands for Slurm job management

Command Description
sbatch <script> Submit script file for execution
scancel <jobid> Signal (kill) jobid for termination
squeue --user <username> View information about jobs from an user
sinfo View information about nodes and partitions
sacct Display accounting data for running/past jobs

Job submission commands

  • salloc, obtain a job allocation. Allows for subsequent execution of an application;
  • srun, obtain a job allocation, and execute an application;
  • sbatch, submit a batch script for later execution;

Table. Command line options for job submission commands.

Option Description
--job-name=<name> Job name
--account=<accountID> Account to be charged for resources used
--begin=<yyyy-mm-dd> Initiate job after specified date/time
--time=<dd-hh:mm:ss> Wall clock duration limit for the job
--partition=<name> Partition/queue in which to run the job
--nodes <minnodes[-maxnodes]> Minimum and maximum nodes required for the job
--ntasks=<count> Number of tasks to be launched
--ntasks-per-socket=<count> Number of tasks to be launched per socket
--ntasks-per-node=<count> Number of tasks to be launched per node
--mem=<MB> Memory required per node
--mem-per-cpu=<MB> Memory required per CPU allocated
--exclusive[=user] Allocated nodes can not be jobs/users
--mail-user=<address> user email address
--mail-type=<begin,end,...> E-mail notification type

Controlling queued and running jobs

To suspend a job that is currently running on the system. This will stop a running job on its current step that can be resumed at a later time.

$ scontrol suspend <job_id>

To resume a paused job, we use scontrol with the resume command:

$ scontrol resume <job_id>

Slurm also provides a utility to hold jobs that are queued in the system. Holding a job will place the job in the lowest priority, effectively “holding” the job from being run. A job can only be held if it’s waiting on the system to be run.

$ scontrol hold <job_id>

We can then release a held job using the release command:

$ scontrol release <job_id>

Accounting commands

  • sacct, Display accounting data.
  • sacctmgr, View and modify account information

Table. Command line options for the sacct command.

Option Description
--user=<name> Displays accountings for a specific user
--allusers Displays accounting for all users jobs
--accounts=<accountIDs> Displays jobs with specified accounts
--endtime=<yyyy-mm-dd> End of reporting period
--format=<spec> Format output
--starttime=<yyyy-mm-dd> Start of reporting period

We can display info about jobs submitted in the last 4 days:

$ sacct --allocations --starttime=now-4days --format=jobid,Elapsed,NCPUS,State,WorkDir%70

Administration commands

Print all Slurm configuration details in slurm.conf (including defaults)

scontrol show config

Reconfiguration of services after changing configuration files:

# scontrol reconfigure

Reserve one node (creat reservation) for two users:

# scontrol create reservation reservationname="my-reservation" starttime=NOW \
           duration=UNLIMITED flags=IGNORE_JOBS users="user1,user2" nodes=cn1

or a similar reservation starting on a specified date/time::

# scontrol create reservation reservationname="my-reservation" starttime=2025-09-24T14:00 \
           duration=UNLIMITED flags=IGNORE_JOBS users="user1,user2" nodes=cn1

or for a specified number of nodes:

# scontrol create reservation ReservationName="dft-meeting" users="user1,user2" \
           StartTime=2025-12-30T08:00:00 Duration=04:00:00 Flags=IGNORE_JOBS TRES=node=1

Check existing reservations

# scontrol show reservations

Job execution on reserved nodes should be issued as:

sbatch --reservation="my-reservation" slurm.sh

A reservation can be deleted with,

sudo scontrol delete ReservationName="my-reservation"

Reserve entire system for maintenance

# scontrol create reservation starttime=NOW \
      duration=UNLIMITED user=root flags=maint,ignore_jobs nodes=ALL

Manage node activity:

# scontrol update NodeName=cn1,cn2 State=DRAIN Reason="Maintenance"
# scontrol update NodeName=cn1,cn2 State=RESUME Reason="Maintenance finished"

The synopsis for the sacctmgr command is:

[root@hn]# sacctmgr <options> <command>

where notable include:

  • --immediate, commits changes without asking for confirmation;
  • --parsable, Pretty (tabular) output format

Table. Commands and options for the sacctmgr command.

Commands for <sacctmgr> Description
reconfigure Reconfigures the SlurmDBD
shutdown Shutdown the server
create <entity> <specs> Create an entity
remove <entity> where <specs> Delete entities
show <entity> [<specs>] Display info about entities
modify <entity> where <specs> set <specs> Modify entity

Table. List of entities for the sacctmgr commabd.

Entity
account bank account
association Used to group information for list and show commands
coordinator Usually an account manager
event Events like downed or draining nodes
job Used to modify specific fields of a job
stats Used with list and show commands to view statistics
tres Used with list and show commands to list Trackable RESources
user Login user name

List runaway (ghost) jobs and fix them. Runaway jobs are jobs that don't exist in the controller but are still considered running or pending in the database.

# sacctmgr show runaway jobs

Manage Slurm daemons and services:

# slurmctld -Dvvvv               # Run Slurm control daemon in the foreground (Head Node)
# slurmdbd -Dvvvv                # Run Slurm database daemon in the foreground (Head Node)
# slurmd -Dvvvv                  # Run Slurm daemon in the foreground (Compute Nodes)
# systemd status slurmctld       # Check daemon status
# systemd stop slurmctld         # Stop daemon
# systemd restart slurmctld      # Restart daemon
# ssh cn1 systemd stop slurmd    # Stop slurmd daemon via SSH on a Compute Node
# journalctl -xeu slurmctld      # Print control daemon activity log