Job Execution (with Slurm)

Parallel job execution on Compute Nodes of mc2 is mediated and scheduled by Slurm, an open source cluster management and job scheduling system. Direct access to compute nodes (e.g. via SSH) is forbiden. In broad terms, users have to submit a script to Slurm containing a sequence of instructions, including a request for a list of resources (CPUs, memory, time) required to execute a particulat application.

Slurm will then weight, hold, authorize and trigger execution based on a multitude of factors, such as how much CPU-time has the user been spending, how much compute power (and respective priority) is being requested by other users, among many other factors.

Key tasks of Slurm include:

Understanding what resources are available on the cluster (compute nodes, CPUs, memory, etc.);
Queuing and allocating jobs to run on compute nodes based on the resources available and the resources specified in the job script;
Monitoring and reporting the status of jobs.

Definitions within Slurm:

BaseBoard: Physical platform sustainig storage, memory, processing, networking and accelerating components of a compute node;
Socket/Processor/Core: See Figure below.
CPU: Within Slurm, a CPU is a logical processing unit, either a core or a thread. Hyperthreading is disabled in mc2, i.e., we always have one thread per core, meaning that no more that one task should be bound to each core.

Definitions of BaseBoard, Node, Socket, Processor, Core

Figure: Physical components within a Node: BaseBoard, Socket, Processor.

Slurm script template (with comments)

The Slurm batch script contains options preceded with #SBATCH before any executable commands in the script. Slurm will stop processing further #SBATCH directives once the first non-comment non-whitespace line has been reached in the script.

Among the Slurm directives we highlight:

Directive	Resources requested
`--nodes`	Number of nodes
`--sockets-per-node`	Number of sockets per node
`--cores-per-socket`	Number of cores per socket
`--ntasks-per-core`	Number of tasks per core
`--ntasks`	Number of MPI or OMP tasks
`--cpus-per-task`	Number of cores per MPI/OMP task
`--distribution`	How tasks are distributed over nodes:sockets
`--mem`	Amount of memory in MiB
`--mem-per-cpu`	Amount of memory/CPU in MiB
`--time`	Amount of time for resource usage

A Slurm script should take care of at least the following:

Request a certain number of CPUs. Here a CPU stands for a processing unit. It can be a compute core or a thread (although hyperthreating is deactivated);
Bind the CPUs into groups of MPI/OMP computational tasks. By default, Slurm will bind one CPU to each task;
Request a certain amount of memory. By default, whole node allocations (using --nodes directive) will reserve the whole memory available on each node. On the other hand, individual CPU requestes will reserve a certain amount of memory per CPU (set by default to 2507 MiB/core)
Request an amount of time during which the above resources will be reserved, and forcibly released when elapsed. This is currently limited to 3 days.
Setup an environment that allows for the execution of our binary.

In order to achieve the above, we allocate a certain number of cores per socket, sockets per node and nodes. The resulting number of cores is then mapped into a group of tasks. By default, the system assigns a single task per core. Hyperthreading is disabled in mc2, so the number of tasks bound to each each core (CPU) cannot be greater that 1.

The default (and maximum) amount of memory per node for compute work is about 235 GiB. Nearly 16 GiB/node are reserved for the operating system.

Below is an annotated template of a Slurm batch file named slurm.sh, which allocates 2 nodes, with 2 sockets each, and 45 cores per socket. This means a total of 2x2x45=180 cores. It also assigns one parallel task per core, allowing us to run 180 MPI (or OMP) tasks:

#!/bin/bash

# Slurm directives
#SBATCH --job-name=my_job_name          # Job name
#SBATCH --time=0-01:00:00               # Wall time limit (d-hh:mm:ss)
#SBATCH --ntasks-per-core=1             # Bind each MPI/OMP task to one core
#SBATCH --cores-per-socket=45           # Number of cores per socket
#SBATCH --sockets-per-node=2            # Number of sockets per node
#SBATCH --nodes=2                       # How many compute nodes
#SBATCH --distribution=block:block      # Block-wise distribution of tasks
#SBATCH --mail-type=END,FAIL            # Send email upon termination/failure
#SBATCH --mail-user=your@email.com      # Email address

# With the above, the total number of CPUs requested is:
# CPU_tot = nodes * sockets-per-node * cores-per-socket * 
#           ntasks-per-core * cpus-per-tas = 2 * 2 * 45 * 1 * 1 = 180

# Prepare environment:
# Clear all lingering modules before loading the module of interest
module purge
module load mymodule

# Execute application
srun /path/to/a.out > std.out

# Gracious script termination
exit 0

Since the number of tasks per core is 1 by default, a practical and leaner version of the above script would look like:

#!/bin/bash

# Slurm directives
#SBATCH --job-name=myjob
#SBATCH --time=0-01:00:00
#SBATCH --ntasks-per-core=1
#SBATCH --cores-per-socket=45
#SBATCH --sockets-per-node=2
#SBATCH --nodes=2
#SBATCH --distribution=block:block

# Prepare environment
module purge
module load mymodule

# Execute application
srun /path/to/a.out > std.out

# Gracious script termination
exit 0

We can however, allocate the resources (cores and memory) by solely specifying the number of MPI/OMP tasks and memory required. By default, each task assignment will correspond to one requested core.

For instance, the following Slurm script informs Slurm that one needs to run 16 MPI/OMP tasks. It binds one CPU (core) to each task, requests a default amount of memory per core (2507 MiB/core), and all that will be reserved for 3 days.

#!/bin/bash

# Slurm directives
#SBATCH --job-name=myjob
#SBATCH --time=3-00:00:00
#SBATCH --ntasks=16
#SBATCH --cpus-per-task=1   # This is the default (not striktly necessary)

# Prepare environment
module purge
module load mymodule

# Execute application
srun /path/to/a.out > std.out

# Gracious script termination
exit 0

Should you want to use a different amount of memory, either use --mem or --mem-per-cpu directives to set an amount in MiB units. Using --mem=0 will reserve the whole memory available in the nodes. You can also use the --exclusive directive to avoid sharing node resources with other jobs (including with other users).

Parallel jobs with OpenMPI

OpenMPI supports two modes of launching parallel jobs under Slurm:

Using OpenMPI's full-featured mpirun launcher.
Using Slurm's direct launch capability with srun.

Although the OpenMPI team recommends using mpirun, unless there is a strong reason, you should always launch your MPI executables under Slurm jobs using srun. With that you will make sure that the Slurm system will perform allocation, log recording, accouning, and cleaning up tasks in a streamline fashion.

Running with `srun`

You can launch parallel jobs using the Slurm's native srun command. Here is a simple (but wasteful) batch script for running on two cores from different sockets:

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --sockets-per-node=2
#SBATCH --cores-per-socket=1 
#SBATCH --distribution=block:block
module purge
module load foss
srun /path/to/the/executable.x

Running with `mpirun`

When mpirun is launched within a Slurm script, it will automatically utilize the Slurm infrastructure for launching and controlling the individual MPI processes. Hence, it is unnecessary to specify the --hostfile, --host, or -n options to mpirun. Below is an example of a Slurm script:

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --sockets-per-node=2
#SBATCH --cores-per-socket=1 
#SBATCH --distribution=block:block
module purge
module load foss
mpirun /path/to/the/executable.x

Alternatively, you may allocate your resources first, launching an interactive Slurm job (see next Section), and then use the mpirun command to execute your parallel program. Below we have a job with a total of two processes, using one core per node (ppr:1:node):

[user@hn]$ srun --nodelist=cn1,cn2 --pty bash -i
[user@cn1]$ module purge
[user@cn1]$ module load foss
[user@cn1]$ mpirun --display-map --display-allocation --host cn1,cn2 --map-by ppr:1:node /path/to/the/executable.x | tee std.out

You cannot use mpirun from the Head Node without prior allocation by Slurm (that will be blocked by the PAM setup).

Interactive Jobs

You can run an interactive job, by asking Slurm to launch a bash shell.

The following command with reserve resources to run 4 MPI/OMP tasks (4 cores) for an indefinite amount of time, and runs the make command using those 4 cores:

[user@hn]$ srun --ntasks=4 --pty bash -i
[user@cn1]$ make -j 4

Launching a bash shell like that will land you on a compute node. Note that MODULEPATH is automatically reconfigured so that the Genoa software stack becomes readily available.

The following command requests one node (indefinitely), and runs the make command using up to 48 cores:

[user@hn]$ srun --nodes=1 --pty bash -i
[user@cn1]$ make -j 48

You can specify the node hostname and secure it exclusively. In the following example I am running a threaded job:

[user@hn]$ srun --nodelist=cn2 --exclusive --pty bash -i
[user@cn1]$ export OMP_NUM_THREADS=48
[user@cn1]$ ./prog.x

Other options, which are usually included in Slurm batch scripts, can be added to the srun command.

Keeping interactive jobs alive

Interactive jobs die when you disconnect from mc2 either by choice or by network connection problems. To keep a job alive you can use a terminal multiplexer like tmux. You should start tmux on the login node before you start an interactive Slurm session:

[user@hn]$ tmux
[user@hn]$ srun --nodes=1 --pty bash -i
[user@cn1]$

In case of a disconnect, simply reconnect mc2 and attach to your tmux session again by typing:

[user@hn]$ tmux attachpurge
[user@cn1]$

In case you have multiple sessions:

[user@hn]$ tmux list-session
0: 1 windows (created Mon Feb 10 15:19:54 2025) (attached)
1: 3 windows (created Mon Feb 10 15:21:50 2025) (attached)
[user@hn]$ tmux attach 1

In the above I attached to sesstion 1. You can get a full list of tmux commands and short-cuts by pressing Ctrl-B followed by ?. See also tmux home page at GitHub.

Creating and using scratch space

Compute nodes possess local scratch volumes for fast I/O operations. Scratch spaces (directories) can be created for the user within a Slurm job, using the directive --constraint=scratch. This turns on the scratch feature on the node that is running the job.

Note 1: Scratch spaces are local, i.e., are only visible to their respective nodes. There is no parallel filesystem such as Lustre or BeeGFS on mc2. As a workaround, allways use your /home space (networked filesystem) for working with files that need to be visible by more than a single node.

Note 2: Scratch spaces are only available within the environment of the Slurm batch script and during its execution --- they are not directly accessible from the login session.

Hint: Interactive/direct access to a specific scratch folder is actually possible via launching an interactive job on the relevant compute node, and from there, browse /scratch/$SLURM_JOB_ID. Scratch files older than 15 days are automatically deleted by the operating system.

Let us consider the following script as reference. It offers a solution to several use-cases, and you may adapt to your best convenience:

#!/bin/bash

# Slurm directives
#SBATCH --job-name=myjob
#SBATCH --time=0-01:00:00
#SBATCH --ntasks=16
#SBATCH --cpus-per-task=1
#SBATCH --constraint=scratch  # Create /scratch/$SLURM_JOB_ID directory

# Prepare environment
module purge
module load mymodule

# Define a variable pointing the the scratch space.
# The Slurm variable SLURM_JOB_ID stores the running job ID number.
SCRATCH_DIR=/scratch/$SLURM_JOB_ID

# Copy a scratch file to the scratch space.
# The Slurm variable SLURM_SUBMIT_DIR points to the directory where
# the Slurm script file is located (this is the job working directory).
rsync -aAXH $SLURM_SUBMIT_DIR/scratch.file $SCRATCH_DIR

# Configure an input.file so it becomes aware of the scratch space.
# This essentially replaces any ocurrence of the sequence "%SCRATCH_DIR%" 
# within "input.file" by the full path to the scratch space.
sed --in-place "s|%SCRATCH_DIR%|$SCRATCH_DIR|g" input.file

# Run the executable
# This reads an "input.file" located on the working directory, and redirects
# the standard output to an "output.file" on the scratch space.
srun /path/to/my/exec < input.file > $SCRATCH_DIR/output.file

# Copy all (new/changed) contents from the scratch space to the job working
# directory. Do not forget the "trailing slash" after SCRATCH_DIR.
rsync -aAXH $SCRATCH_DIR/ $SLURM_SUBMIT_DIR

# Gracious script termination
exit 0

The --constraint=scratch directive creates a directory /scratch/$SLURM_JOB_ID on the execution node, with $SLURM_JOB_ID being replaced by the serial number of the job.

Before executing the application, we define the full path for the scratch space as SCRATCH_DIR, and make use of its contents with $SCRATCH_DIR.

We copy files between the job working directory (where the Slurm batch script is located) and the scratch space using the rsync command (the cp command could be used instead). We can also make input files aware of the scratch path. For that we insert a placeholder within the file (for instance %SCRATCH_DIR%), and replace it by the desired path using the sed command.