Scalable Deep Learning on the UL HPC Platform
Copyright (c) 2013-2019 UL HPC Team <firstname.lastname@example.org>
The objective of this tutorial is to practice running Horovod (and Keras/TensorFlow) on the UL HPC iris cluster.
It's important that you read the slides first.
Horovod with TensorFlow, multi-node & multi-GPU tests
- As an initial test, you will now use the following launcher to:
- reserve 2 GPU nodes and all their GPUs (8) - edit the launcher to match this
- start Horovod through its
#!/bin/bash -l #SBATCH -J HorovodTFGPU #SBATCH -o %x_%j.out #SBATCH -N 2 #SBATCH --ntasks-per-node=4 #SBATCH --gres=gpu:4 #SBATCH -t 1:0:0 #SBATCH -p gpu ## The following only needed during HPC School 2019.06 module load swenv/default-env/devel ## Load TF and Horovod that link to CUDA, cuDNN & NCCL module load lib/TensorFlow/1.13.1-fosscuda-2019a-Python-3.7.2 module load tools/Horovod/0.16.3-fosscuda-2019a-Python-3.7.2 ## Create tests directory and clone the TF benchmarks inside mkdir $SCRATCH/tests-horovod && cd $SCRATCH/tests-horovod git clone https://github.com/tensorflow/benchmarks cd benchmarks git checkout cnn_tf_v1.13_compatible ## Horovod execution horovodrun -np $SLURM_NTASKS \ python scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \ --model resnet101 --batch_size 64 --variable_update horovod
- Now check:
- how many images/sec did the benchmark show at the end of its execution?
what results do you get from a single node (4 GPUs)? and from a single GPU?
If you load the non-accelerated versions for TF and Horovod (the ones on the
- what result do you get from a regular compute node without GPUs (use the
batchpartition) when using it fully, i.e. 28 cores?
- how many full regular nodes do you need to use to replicate the benchmark result from a single accelerated node with its 4 GPUs?
Horovod with Keras and TensorFlow
For this part we will use the (excellent) SC18 Tutorial: Deep Learning At Scale.
You will need to
git clone https://github.com/NERSC/sc18-dl-tutorial on the Iris cluster (preferably under your $SCRATCH).
Then we will need to adapt its input configuration files under
configs and the launcher
You will find under the current (UL HPC) tutorial's repository customized files to be used for the Iris cluster:
configs/ ├── cifar10_cnn.yaml ├── cifar10_resnet.yaml ├── hello.yaml └── imagenet_resnet.yaml scripts/ ├── cifar_cnn.sh ├── cifar_resnet.sh ├── imagenet_resnet.sh └── setup.sh
Typically you will start by launching the
cifar-cnn.sh example, and will quickly discover it's running slow (check the appropriate output in
What will you need to adapt? What's different from the
*_resnet.sh launchers? (take a look at what
train.py does in