Scalable Deep Learning on the UL HPC Platform
The objective of this tutorial is to practice running Horovod (and Keras/TensorFlow) on the UL HPC iris cluster.
It's important that you read the slides first.
Horovod with TensorFlow, multi-node & multi-GPU tests
- As an initial test, you will now use the following launcher to:
- reserve 2 GPU nodes and all their GPUs (8) - edit the launcher to match this
- start Horovod through its
#!/bin/bash -l
#SBATCH -o %x_%j.out
#SBATCH --ntasks-per-node=4
#SBATCH --gres=gpu:4
#SBATCH -t 1:0:0
#SBATCH -p gpu
## The following only needed during HPC School 2019.06
module load swenv/default-env/devel
## Load TF and Horovod that link to CUDA, cuDNN & NCCL
module load lib/TensorFlow/1.13.1-fosscuda-2019a-Python-3.7.2
module load tools/Horovod/0.16.3-fosscuda-2019a-Python-3.7.2
## Create tests directory and clone the TF benchmarks inside
mkdir $SCRATCH/tests-horovod && cd $SCRATCH/tests-horovod
git clone
cd benchmarks
git checkout cnn_tf_v1.13_compatible
## Horovod execution
horovodrun -np $SLURM_NTASKS \
python scripts/tf_cnn_benchmarks/ \
--model resnet101 --batch_size 64 --variable_update horovod
- Now check:
- how many images/sec did the benchmark show at the end of its execution?
what results do you get from a single node (4 GPUs)? and from a single GPU?
If you load the non-accelerated versions for TF and Horovod (the ones on the
instead offosscuda
toolchain) - what result do you get from a regular compute node without GPUs (use the
partition) when using it fully, i.e. 28 cores? - how many full regular nodes do you need to use to replicate the benchmark result from a single accelerated node with its 4 GPUs?
Horovod with Keras and TensorFlow
For this part we will use the (excellent) SC18 Tutorial: Deep Learning At Scale.
You will need to git clone
on the Iris cluster (preferably under your $SCRATCH).
Then we will need to adapt its input configuration files under configs
and the launcher scripts
You will find under the current (UL HPC) tutorial's repository customized files to be used for the Iris cluster:
├── cifar10_cnn.yaml
├── cifar10_resnet.yaml
├── hello.yaml
└── imagenet_resnet.yaml
Typically you will start by launching the
example, and will quickly discover it's running slow (check the appropriate output in logs/
What will you need to adapt? What's different from the *
launchers? (take a look at what
does in --distributed