Slurm

Slurm frequent commands

Table. List of common commands for Slurm job management

Command	Description
`sbatch <script>`	Submit script file for execution
`scancel <jobid>`	Signal (kill) jobid for termination
`squeue --user <username>`	View information about jobs from an user
`sinfo`	View information about nodes and partitions
`sacct`	Display accounting data for running/past jobs

Job submission commands

salloc, obtain a job allocation. Allows for subsequent execution of an application;
srun, obtain a job allocation, and execute an application;
sbatch, submit a batch script for later execution;

Table. Command line options for job submission commands.

Option	Description
`--job-name=<name>`	Job name
`--account=<accountID>`	Account to be charged for resources used
`--begin=<yyyy-mm-dd>`	Initiate job after specified date/time
`--time=<dd-hh:mm:ss>`	Wall clock duration limit for the job
`--partition=<name>`	Partition/queue in which to run the job
`--nodes <minnodes[-maxnodes]>`	Minimum and maximum nodes required for the job
`--ntasks=<count>`	Number of tasks to be launched
`--ntasks-per-socket=<count>`	Number of tasks to be launched per socket
`--ntasks-per-node=<count>`	Number of tasks to be launched per node
`--distribution=block:block`	Distribute blocks of tasks over nodes:sockets
`--mem=<MB>`	Memory required per node
`--mem-per-cpu=<MB>`	Memory required per CPU allocated
`--exclusive[=user]`	Allocated nodes can not be jobs/users
`--mail-user=<address>`	user email address
`--mail-type=<begin,end,...>`	E-mail notification type

Controlling queued and running jobs

To suspend a job that is currently running on the system. This will stop a running job on its current step that can be resumed at a later time.

$ scontrol suspend <job_id>

To resume a paused job, we use scontrol with the resume command:

$ scontrol resume <job_id>

Slurm also provides a utility to hold jobs that are queued in the system. Holding a job will place the job in the lowest priority, effectively “holding” the job from being run. A job can only be held if it’s waiting on the system to be run.

$ scontrol hold <job_id>

We can then release a held job using the release command:

$ scontrol release <job_id>

Accounting commands

sacct, Display accounting data.
sacctmgr, View and modify account information

Table. Command line options for the sacct command.

Option	Description
`--user=<name>`	Displays accountings for a specific user
`--allusers`	Displays accounting for all users jobs
`--accounts=<accountIDs>`	Displays jobs with specified accounts
`--endtime=<yyyy-mm-dd>`	End of reporting period
`--format=<spec>`	Format output
`--starttime=<yyyy-mm-dd>`	Start of reporting period

We can display info about jobs submitted in the last 4 days:

$ sacct --allocations --starttime=now-4days --format=jobid,Elapsed,NCPUS,State,WorkDir%70

Administration commands

Print all Slurm configuration details in slurm.conf (including defaults)

scontrol show config

Reconfiguration of services after changing configuration files:

# scontrol reconfigure

Reserve one node (creat reservation) for two users:

# scontrol create reservation reservationname="my-reservation" starttime=NOW \
           duration=UNLIMITED flags=IGNORE_JOBS users="user1,user2" nodes=cn1

or a similar reservation starting on a specified date/time::

# scontrol create reservation reservationname="my-reservation" starttime=2025-09-24T14:00 \
           duration=UNLIMITED flags=IGNORE_JOBS users="user1,user2" nodes=cn1

or for a specified number of nodes:

# scontrol create reservation ReservationName="dft-meeting" users="user1,user2" \
           StartTime=2025-12-30T08:00:00 Duration=04:00:00 Flags=IGNORE_JOBS TRES=node=1

Check existing reservations

# scontrol show reservations

Job execution on reserved nodes should be issued as:

sbatch --reservation="my-reservation" slurm.sh

A reservation can be deleted with,

sudo scontrol delete ReservationName="my-reservation"

Reserve entire system for maintenance

# scontrol create reservation starttime=NOW \
      duration=UNLIMITED user=root flags=maint,ignore_jobs nodes=ALL

Manage node activity:

# scontrol update NodeName=cn1,cn2 State=DRAIN Reason="Maintenance"
# scontrol update NodeName=cn1,cn2 State=RESUME Reason="Maintenance finished"

The synopsis for the sacctmgr command is:

[root@hn]# sacctmgr <options> <command>

where notable include:

--immediate, commits changes without asking for confirmation;
--parsable, Pretty (tabular) output format

Table. Commands and options for the sacctmgr command.

Commands for `<sacctmgr>`	Description
`reconfigure`	Reconfigures the SlurmDBD
`shutdown`	Shutdown the server
`create <entity> <specs>`	Create an entity
`remove <entity> where <specs>`	Delete entities
`show <entity> [<specs>]`	Display info about entities
`modify <entity> where <specs> set <specs>`	Modify entity

Table. List of entities for the sacctmgr commabd.

Entity
`account`	bank account
`association`	Used to group information for `list` and `show` commands
`coordinator`	Usually an account manager
`event`	Events like downed or draining nodes
`job`	Used to modify specific fields of a job
`stats`	Used with `list` and `show` commands to view statistics
`tres`	Used with `list` and `show` commands to list Trackable RESources
`user`	Login user name

List runaway (ghost) jobs and fix them. Runaway jobs are jobs that don't exist in the controller but are still considered running or pending in the database.

# sacctmgr show runaway jobs

Manage Slurm daemons and services:

# slurmctld -Dvvvv               # Run Slurm control daemon in the foreground (Head Node)
# slurmdbd -Dvvvv                # Run Slurm database daemon in the foreground (Head Node)
# slurmd -Dvvvv                  # Run Slurm daemon in the foreground (Compute Nodes)
# systemd status slurmctld       # Check daemon status
# systemd stop slurmctld         # Stop daemon
# systemd restart slurmctld      # Restart daemon
# ssh cn1 systemd stop slurmd    # Stop slurmd daemon via SSH on a Compute Node
# journalctl -xeu slurmctld      # Print control daemon activity log