Scalable Deep Learning on the UL HPC Platform

 Copyright (c) 2013-2019 UL HPC Team  <hpc-sysadmins@uni.lu>

The objective of this tutorial is to practice running Horovod (and Keras/TensorFlow) on the UL HPC iris cluster.

It's important that you read the slides first.

Horovod with TensorFlow, multi-node & multi-GPU tests

As an initial test, you will now use the following launcher to:
reserve 2 GPU nodes and all their GPUs (8) - edit the launcher to match this
start Horovod through its horovodrun wrapper

#!/bin/bash -l
#SBATCH -J HorovodTFGPU
#SBATCH -o %x_%j.out
#SBATCH -N 2
#SBATCH --ntasks-per-node=4
#SBATCH --gres=gpu:4
#SBATCH -t 1:0:0
#SBATCH -p gpu

## The following only needed during HPC School 2019.06
module load swenv/default-env/devel

## Load TF and Horovod that link to CUDA, cuDNN & NCCL
module load lib/TensorFlow/1.13.1-fosscuda-2019a-Python-3.7.2
module load tools/Horovod/0.16.3-fosscuda-2019a-Python-3.7.2

## Create tests directory and clone the TF benchmarks inside
mkdir $SCRATCH/tests-horovod && cd $SCRATCH/tests-horovod
git clone https://github.com/tensorflow/benchmarks
cd benchmarks
git checkout cnn_tf_v1.13_compatible

## Horovod execution
horovodrun -np $SLURM_NTASKS \
    python scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
      --model resnet101 --batch_size 64 --variable_update horovod

Now check:
how many images/sec did the benchmark show at the end of its execution?
what results do you get from a single node (4 GPUs)? and from a single GPU?
If you load the non-accelerated versions for TF and Horovod (the ones on the foss instead of fosscuda toolchain)
what result do you get from a regular compute node without GPUs (use the batch partition) when using it fully, i.e. 28 cores?
how many full regular nodes do you need to use to replicate the benchmark result from a single accelerated node with its 4 GPUs?

Horovod with Keras and TensorFlow

For this part we will use the (excellent) SC18 Tutorial: Deep Learning At Scale.

You will need to git clone https://github.com/NERSC/sc18-dl-tutorial on the Iris cluster (preferably under your $SCRATCH). Then we will need to adapt its input configuration files under configs and the launcher scripts.

You will find under the current (UL HPC) tutorial's repository customized files to be used for the Iris cluster:

configs/
├── cifar10_cnn.yaml
├── cifar10_resnet.yaml
├── hello.yaml
└── imagenet_resnet.yaml
scripts/
├── cifar_cnn.sh
├── cifar_resnet.sh
├── imagenet_resnet.sh
└── setup.sh

Typically you will start by launching the cifar-cnn.sh example, and will quickly discover it's running slow (check the appropriate output in logs/. What will you need to adapt? What's different from the *_resnet.sh launchers? (take a look at what train.py does in --distributed mode).