Job scheduling with SLURM
Copyright (c) 2013-2021 UL HPC Team <hpc-sysadmins@uni.lu>
This page is part of the Getting started tutorial, and the follow-up of the "Overview" section.
The basics
Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. It is used on Iris UL HPC cluster.
- It allocates exclusive or non-exclusive access to the resources (compute nodes) to users during a limited amount of time so that they can perform they work
- It provides a framework for starting, executing and monitoring work
- It arbitrates contention for resources by managing a queue of pending work.
- It permits to schedule jobs for users on the cluster resource
Vocabulary
-
A
user job
is characterized by:- the number of nodes
- the number of CPU cores
- the memory requested
- the walltime
- the launcher script, which will initiate your tasks
-
Partition: group of compute nodes, with specific usage characteristics (time limits and maximum number of nodes per job
-
QOS: The quality of service associated with a job affects the way it is scheduled (priority, preemption, limits per user, etc).
-
Tasks: processes run in parallel inside the job
Hands on
We will now see the basic commands of Slurm.
- Connect to aion-cluster or iris-cluster. You can request resources in interactive mode:
(access)$> si
Notice that with no other parameters, srun gave you one resource for 30 minutes. You were also directly connected to the node you reserved with an interactive shell.
Now exit the reservation (exit
command or CTRL-D)
When you run exit, you are disconnected and your reservation is terminated.
To avoid anticipated termination of your jobs in case of errors (terminal closed by mistake), you can reserve and connect in two steps using the job id associated to your reservation.
- First run a passive job i.e. run a predefined command -- here
sleep 4h
to delay the execution for 4 hours -- on the first reserved node:(access)$> sbatch --qos normal --wrap "sleep 4h" Submitted batch job 390
You noticed that you received a job ID (in the above example: 390
), which you can later use to connect to the reserved resource(s):
(access)$> sjoin 390 # adapt the job ID accordingly ;)
(node)$> ps aux | grep sleep
<login> 186342 0.0 0.0 107896 604 ? S 17:58 0:00 sleep 4h
<login> 187197 0.0 0.0 112656 968 pts/0 S+ 18:04 0:00 grep --color=auto sleep
(node)$> exit # or CTRL-D
Question: At which moment the job 390
will end?
a. after 10 days
b. after 2 hours
c. never, only when I'll delete the job
Question: manipulate the $SLURM_*
variables over the command-line to extract the following information, once connected to your job
a. the list of hostnames where a core is reserved (one per line)
* hint: man echo
b. number of reserved cores
* hint: search for the NPROCS variable
c. number of reserved nodes
* hint: search for the NNODES variable
d. number of cores reserved per node together with the node name (one per line) * Example of output:
12 iris-11
12 iris-15
- hint:
NPROCS variable or NODELIST
Job management with interactive jobs
Normally, the previously run job is still running.
- You can check the status of your running jobs using
squeue
command:(access)$> squeue # list all jobs (access)$> squeue -u <login> # list all your jobs
Then you can delete your job by running scancel
command:
(access)$> scancel 390
- You can see your system-level utilization (memory, I/O, energy) of a running job using
sstat $jobid
:(access)$> sstat 390
In all remaining examples of reservation in this section, remember to delete the reserved jobs afterwards (using scancel
or CTRL-D
)
You probably want to use more than one core, and you might want them for a different duration than one hour.
-
Reserve interactively 4 cores in one task on one node, for 30 minutes (delete the job afterwards)
(access)$> salloc -p interactive --time=0:30:0 -N 1 --ntasks-per-node=1 --cpus-per-task=4
-
Reserve interactively 4 tasks (system processes) with 2 nodes for 30 minutes (delete the job afterwards)
(access)$> salloc -p interactive --time=0:30:0 -N 2 --ntasks-per-node=4 --cpus-per-task=4
This command can also be written in a more compact way
(access)$> si --time=0:30:0 -N2 -n4 -c2
- You can stop a waiting job from being scheduled and later, allow it to be scheduled:
(access)$> scontrol hold $SLURM_JOB_ID (access)$> scontrol release $SLURM_JOB_ID
Passive jobs submission
In the previous section, we have started interactive jobs using the srun
command.
When srun
is used on the access server, it can be used to submit an interactive jobs, meaning that the srun
will wait for the job to be scheduled before starting a new shell.
The other possibility is to submit a passive job with sbatch
.
In this case, resources can be specified on the command line as srun
, or in the launcher script.
The launcher script first lines can contain headers, in the following case, in this case, the following launcher script will request a single task of one ore for 5 minutes, in the batch
queue.
#!/bin/bash -l
#SBATCH -N 1
#SBATCH --ntasks-per-node=1
#SBATCH -c 1
#SBATCH --time=0-00:05:00
#SBATCH -p batch
echo "Hello World!"
Save this content in a file named launcher.sh
and submit it with sbatch
(access)$> sbatch launcher.sh
Modify this example to:
- request 6 single core tasks equally spread across two nodes for 3 hours
- Request as many tasks as cores available on a single node in the batch queue for 3 hours
- Request one sequential task requiring half the memory of a regular node for 1 day (use the header
#SBATCH --mem=64GB
on Iris or#SBATCH --mem=128GB
) - Iris only - request one GPU tasks for 4 hours - dedicate 1/4 of available cores for its management (use the gpu partition and the header
#SBATCH -G 1
)