Scalable Deep Learning on the UL HPC Platform
Copyright (c) 2013-2019 UL HPC Team <hpc-sysadmins@uni.lu>
The objective of this tutorial is to practice running Horovod (and Keras/TensorFlow) on the UL HPC iris cluster.
It's important that you read the slides first.
Horovod with TensorFlow, multi-node & multi-GPU tests
- As an initial test, you will now use the following launcher to:
- reserve 2 GPU nodes and all their GPUs (8) - edit the launcher to match this
- start Horovod through its
horovodrun
wrapper
#!/bin/bash -l
#SBATCH -J HorovodTFGPU
#SBATCH -o %x_%j.out
#SBATCH -N 2
#SBATCH --ntasks-per-node=4
#SBATCH --gres=gpu:4
#SBATCH -t 1:0:0
#SBATCH -p gpu
## The following only needed during HPC School 2019.06
module load swenv/default-env/devel
## Load TF and Horovod that link to CUDA, cuDNN & NCCL
module load lib/TensorFlow/1.13.1-fosscuda-2019a-Python-3.7.2
module load tools/Horovod/0.16.3-fosscuda-2019a-Python-3.7.2
## Create tests directory and clone the TF benchmarks inside
mkdir $SCRATCH/tests-horovod && cd $SCRATCH/tests-horovod
git clone https://github.com/tensorflow/benchmarks
cd benchmarks
git checkout cnn_tf_v1.13_compatible
## Horovod execution
horovodrun -np $SLURM_NTASKS \
python scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
--model resnet101 --batch_size 64 --variable_update horovod
- Now check:
- how many images/sec did the benchmark show at the end of its execution?
-
what results do you get from a single node (4 GPUs)? and from a single GPU?
-
If you load the non-accelerated versions for TF and Horovod (the ones on the
foss
instead offosscuda
toolchain) - what result do you get from a regular compute node without GPUs (use the
batch
partition) when using it fully, i.e. 28 cores? - how many full regular nodes do you need to use to replicate the benchmark result from a single accelerated node with its 4 GPUs?
Horovod with Keras and TensorFlow
For this part we will use the (excellent) SC18 Tutorial: Deep Learning At Scale.
You will need to git clone https://github.com/NERSC/sc18-dl-tutorial
on the Iris cluster (preferably under your $SCRATCH).
Then we will need to adapt its input configuration files under configs
and the launcher scripts
.
You will find under the current (UL HPC) tutorial's repository customized files to be used for the Iris cluster:
configs/
├── cifar10_cnn.yaml
├── cifar10_resnet.yaml
├── hello.yaml
└── imagenet_resnet.yaml
scripts/
├── cifar_cnn.sh
├── cifar_resnet.sh
├── imagenet_resnet.sh
└── setup.sh
Typically you will start by launching the cifar-cnn.sh
example, and will quickly discover it's running slow (check the appropriate output in logs/
.
What will you need to adapt? What's different from the *_resnet.sh
launchers? (take a look at what train.py
does in --distributed
mode).