Job Execution (with Slurm)
Parallel job execution on Compute Nodes of mc2 is mediated and scheduled by Slurm, an open source cluster management and job scheduling system. Direct access to compute nodes (e.g. via SSH) is forbiden. In broad terms, users have to submit a script to Slurm containing a sequence of instructions, including a request for a list of resources (CPUs, memory, time) required to execute a particulat application.
Slurm will then weight, hold, authorize and trigger execution based on a multitude of factors, such as how much CPU-time has the user been spending, how much compute power (and respective priority) is being requested by other users, among many other factors.
Key tasks of Slurm include:
-
Understanding what resources are available on the cluster (compute nodes, CPUs, memory, etc.);
-
Queuing and allocating jobs to run on compute nodes based on the resources available and the resources specified in the job script;
-
Monitoring and reporting the status of jobs.
Definitions within Slurm:
-
BaseBoard: Physical platform sustainig storage, memory, processing, networking and accelerating components of a compute node;
-
Socket/Processor/Core: See Figure below.
-
CPU: Within Slurm, a CPU is a logical processing unit, either a core or a thread. Hyperthreading is disabled in mc2, i.e., we always have one thread per core, meaning that no more that one task should be bound to each core.

Figure: Physical components within a Node: BaseBoard, Socket, Processor.
Slurm script template (with comments)
The Slurm batch script contains options preceded with #SBATCH before any
executable commands in the script. Slurm will stop processing further
#SBATCH directives once the first non-comment non-whitespace line has
been reached in the script.
Among the Slurm directives we highlight:
| Directive | Resources requested |
|---|---|
--nodes |
Number of nodes |
--sockets-per-node |
Number of sockets per node |
--cores-per-socket |
Number of cores per socket |
--ntasks-per-core |
Number of tasks per core |
--ntasks |
Number of MPI or OMP tasks |
--cpus-per-task |
Number of cores per MPI/OMP task |
--mem |
Amount of memory in MiB |
--mem-per-cpu |
Amount of memory/CPU in MiB |
--time |
Amount of time for resource usage |
A Slurm script should take care of at least the following:
-
Request a certain number of CPUs. Here a CPU stands for a processing unit. It can be a compute core or a thread (although hyperthreating is deactivated);
-
Bind the CPUs into groups of MPI/OMP computational tasks. By default, Slurm will bind one CPU to each task;
-
Request a certain amount of memory. By default, whole node allocations (using
--nodesdirective) will reserve the whole memory available on each node. On the other hand, individual CPU requestes will reserve a certain amount of memory per CPU (set by default to 2507 MiB/core) -
Request an amount of time during which the above resources will be reserved, and forcibly released when elapsed. This is currently limited to 3 days.
-
Setup an environment that allows for the execution of our binary.
In order to achieve the above, we allocate a certain number of cores per socket, sockets per node and nodes. The resulting number of cores is then mapped into a group of tasks. By default, the system assigns a single task per core. Hyperthreading is disabled in mc2, so the number of tasks bound to each each core (CPU) cannot be greater that 1.
The default (and maximum) amount of memory per node for compute work is about 235 GiB. Nearly 16 GiB/node are reserved for the operating system.
Below is an annotated template of a Slurm batch file named slurm.sh,
which allocates 2 nodes, with 2 sockets each, and 45 cores per socket. This
means a total of 2x2x45=180 cores. It also assigns one parallel task per core,
allowing us to run 180 MPI (or OMP) tasks:
#!/bin/bash
# Slurm directives
#SBATCH --job-name=my_job_name # Job name
#SBATCH --time=0-01:00:00 # Wall time limit (d-hh:mm:ss)
#SBATCH --ntasks-per-core=1 # Bind each MPI/OMP task to one core
#SBATCH --cores-per-socket=45 # Number of cores per socket
#SBATCH --sockets-per-node=2 # Number of sockets per node
#SBATCH --nodes=2 # How many compute nodes
#SBATCH --mail-type=END,FAIL # Send email upon termination/failure
#SBATCH --mail-user=your@email.com # Email address
# With the above, the total number of CPUs requested is:
# CPU_tot = nodes * sockets-per-node * cores-per-socket *
# ntasks-per-core * cpus-per-tas = 2 * 2 * 45 * 1 * 1 = 180
# Prepare environment:
# Clear all lingering modules before loading the module of interest
module purge
module load mymodule
# Execute application
srun /path/to/a.out > std.out
# Gracious script termination
exit 0
Since the number of tasks per core is 1 by default, a practical and leaner version of the above script would look like:
#!/bin/bash
# Slurm directives
#SBATCH --job-name=myjob
#SBATCH --time=0-01:00:00
#SBATCH --cores-per-socket=45
#SBATCH --sockets-per-node=2
#SBATCH --nodes=2
# Prepare environment
module purge
module load mymodule
# Execute application
srun /path/to/a.out > std.out
# Gracious script termination
exit 0
We can however, allocate the resources (cores and memory) by solely specifying the number of MPI/OMP tasks and memory required. By default, each task assignment will correspond to one requested core.
For instance, the following Slurm script informs Slurm that one needs to run 16 MPI/OMP tasks. It binds one CPU (core) to each task, requests a default amount of memory per core (2507 MiB/core), and all that will be reserved for 3 days.
#!/bin/bash
# Slurm directives
#SBATCH --job-name=myjob
#SBATCH --time=3-00:00:00
#SBATCH --ntasks=16
#SBATCH --cpus-per-task=1 # This is the default (not striktly necessary)
# Prepare environment
module purge
module load mymodule
# Execute application
srun /path/to/a.out > std.out
# Gracious script termination
exit 0
Should you want to use a different amount of memory,
either use --mem or --mem-per-cpu directives to set an amount in MiB units.
Using --mem=0 will
reserve the whole memory available in the nodes. You can also
use the --exclusive directive to avoid sharing node resources with other jobs
(including with other users).
Parallel jobs with OpenMPI
OpenMPI supports two modes of launching parallel jobs under Slurm:
- Using OpenMPI's full-featured
mpirunlauncher. - Using Slurm's direct launch capability with
srun.
Although the OpenMPI team recommends using mpirun, unless there is a
strong reason, you should always launch your MPI executables under Slurm jobs
using srun. With that you will make sure that the Slurm system will perform
allocation, log recording, accouning, and cleaning up tasks in a streamline
fashion.
Running with srun
You can launch parallel jobs using the Slurm's native srun command.
Here is a simple (but wasteful) batch script for running
on two cores from different sockets:
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --sockets-per-node=2
#SBATCH --cores-per-socket=1
module purge
module load foss
srun /path/to/the/executable.x
Running with mpirun
When mpirun is launched within a Slurm script, it will automatically utilize
the Slurm infrastructure for launching and controlling the individual MPI
processes. Hence, it is unnecessary to specify the --hostfile, --host,
or -n options to mpirun. Below is an example of a Slurm script:
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --sockets-per-node=2
#SBATCH --cores-per-socket=1
module purge
module load foss
mpirun /path/to/the/executable.x
Alternatively, you may allocate your resources first, launching an interactive
Slurm job (see next Section), and then use the mpirun command to execute
your parallel program.
Below we have a job with a total of two processes, using one core per node
(ppr:1:node):
[user@hn]$ srun --nodelist=cn1,cn2 --pty bash -i
[user@cn1]$ module purge
[user@cn1]$ module load foss
[user@cn1]$ mpirun --display-map --display-allocation --host cn1,cn2 --map-by ppr:1:node /path/to/the/executable.x | tee std.out
You cannot use mpirun from the Head Node without prior allocation by Slurm
(that will be blocked by the PAM setup).
Interactive Jobs
You can run an interactive job, by asking Slurm to launch a bash shell.
The following command with reserve resources to run 4 MPI/OMP tasks (4 cores)
for an indefinite amount of time, and runs the make command using those 4 cores:
[user@hn]$ srun --ntasks=4 --pty bash -i
[user@cn1]$ make -j 4
Launching a bash shell like that will land you on a compute node. Note that
MODULEPATH is automatically reconfigured so that the Genoa
software stack becomes readily available.
The following command requests one node (indefinitely), and runs the
make command using up to 48 cores:
[user@hn]$ srun --nodes=1 --pty bash -i
[user@cn1]$ make -j 48
You can specify the node hostname and secure it exclusively. In the following example I am running a threaded job:
[user@hn]$ srun --nodelist=cn2 --exclusive --pty bash -i
[user@cn1]$ export OMP_NUM_THREADS=48
[user@cn1]$ ./prog.x
Other options, which are usually included in Slurm batch scripts, can be
added to the srun command.
Keeping interactive jobs alive
Interactive jobs die when you disconnect from mc2 either by choice or by
network connection problems. To keep a job alive you can use a terminal
multiplexer like tmux. You should start tmux on the login node before you
start an interactive Slurm session:
[user@hn]$ tmux
[user@hn]$ srun --nodes=1 --pty bash -i
[user@cn1]$
In case of a disconnect, simply reconnect mc2 and attach to your tmux session again by typing:
[user@hn]$ tmux attachpurge
[user@cn1]$
In case you have multiple sessions:
[user@hn]$ tmux list-session
0: 1 windows (created Mon Feb 10 15:19:54 2025) (attached)
1: 3 windows (created Mon Feb 10 15:21:50 2025) (attached)
[user@hn]$ tmux attach 1
In the above I attached to sesstion 1. You can get a full list of tmux
commands and short-cuts by pressing Ctrl-B followed by ?. See also
tmux home page at GitHub.