Job scheduling with SLURM
Copyright (c) 2013-2021 UL HPC Team <firstname.lastname@example.org>
This page is part of the Getting started tutorial, and the follow-up of the "Overview" section.
Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. It is used on Iris UL HPC cluster.
- It allocates exclusive or non-exclusive access to the resources (compute nodes) to users during a limited amount of time so that they can perform they work
- It provides a framework for starting, executing and monitoring work
- It arbitrates contention for resources by managing a queue of pending work.
- It permits to schedule jobs for users on the cluster resource
user jobis characterized by:
- the number of nodes
- the number of CPU cores
- the memory requested
- the walltime
- the launcher script, which will initiate your tasks
Partition: group of compute nodes, with specific usage characteristics (time limits and maximum number of nodes per job
QOS: The quality of service associated with a job affects the way it is scheduled (priority, preemption, limits per user, etc).
Tasks: processes run in parallel inside the job
We will now see the basic commands of Slurm.
- Connect to aion-cluster or iris-cluster. You can request resources in interactive mode:
Notice that with no other parameters, srun gave you one resource for 30 minutes. You were also directly connected to the node you reserved with an interactive shell.
Now exit the reservation (
exit command or CTRL-D)
When you run exit, you are disconnected and your reservation is terminated.
To avoid anticipated termination of your jobs in case of errors (terminal closed by mistake), you can reserve and connect in two steps using the job id associated to your reservation.
- First run a passive job i.e. run a predefined command -- here
sleep 4hto delay the execution for 4 hours -- on the first reserved node:
(access)$> sbatch --qos normal --wrap "sleep 4h" Submitted batch job 390
You noticed that you received a job ID (in the above example:
390), which you can later use to connect to the reserved resource(s):
(access)$> sjoin 390 # adapt the job ID accordingly ;) (node)$> ps aux | grep sleep <login> 186342 0.0 0.0 107896 604 ? S 17:58 0:00 sleep 4h <login> 187197 0.0 0.0 112656 968 pts/0 S+ 18:04 0:00 grep --color=auto sleep (node)$> exit # or CTRL-D
Question: At which moment the job
390 will end?
a. after 10 days
b. after 2 hours
c. never, only when I'll delete the job
Question: manipulate the
$SLURM_* variables over the command-line to extract the following information, once connected to your job
a. the list of hostnames where a core is reserved (one per line)
b. number of reserved cores
search for the NPROCS variable
c. number of reserved nodes
search for the NNODES variable
d. number of cores reserved per node together with the node name (one per line) * Example of output:
12 iris-11 12 iris-15
NPROCS variable or NODELIST
Job management with interactive jobs
Normally, the previously run job is still running.
- You can check the status of your running jobs using
(access)$> squeue # list all jobs (access)$> squeue -u <login> # list all your jobs
Then you can delete your job by running
(access)$> scancel 390
- You can see your system-level utilization (memory, I/O, energy) of a running job using
(access)$> sstat 390
In all remaining examples of reservation in this section, remember to delete the reserved jobs afterwards (using
You probably want to use more than one core, and you might want them for a different duration than one hour.
Reserve interactively 4 cores in one task on one node, for 30 minutes (delete the job afterwards)
(access)$> salloc -p interactive --time=0:30:0 -N 1 --ntasks-per-node=1 --cpus-per-task=4
Reserve interactively 4 tasks (system processes) with 2 nodes for 30 minutes (delete the job afterwards)
(access)$> salloc -p interactive --time=0:30:0 -N 2 --ntasks-per-node=4 --cpus-per-task=4
This command can also be written in a more compact way
(access)$> si --time=0:30:0 -N2 -n4 -c2
- You can stop a waiting job from being scheduled and later, allow it to be scheduled:
(access)$> scontrol hold $SLURM_JOB_ID (access)$> scontrol release $SLURM_JOB_ID
Passive jobs submission
In the previous section, we have started interactive jobs using the
srun is used on the access server, it can be used to submit an interactive jobs, meaning that the
srun will wait for the job to be scheduled before starting a new shell.
The other possibility is to submit a passive job with
In this case, resources can be specified on the command line as
srun, or in the launcher script.
The launcher script first lines can contain headers, in the following case, in this case, the following launcher script will request a single task of one ore for 5 minutes, in the
#!/bin/bash -l #SBATCH -N 1 #SBATCH --ntasks-per-node=1 #SBATCH -c 1 #SBATCH --time=0-00:05:00 #SBATCH -p batch echo "Hello World!"
Save this content in a file named
launcher.sh and submit it with
(access)$> sbatch launcher.sh
Modify this example to:
- request 6 single core tasks equally spread across two nodes for 3 hours
- Request as many tasks as cores available on a single node in the batch queue for 3 hours
- Request one sequential task requiring half the memory of a regular node for 1 day (use the header
#SBATCH --mem=64GBon Iris or
- Iris only - request one GPU tasks for 4 hours - dedicate 1/4 of available cores for its management (use the gpu partition and the header
#SBATCH -G 1)