Copyright (c) 2013-2023 P. Pochelu and UL HPC Team <hpc-sysadmins@uni.lu>
Horovod
Horovod (website) is a framework for scaling training based on the Ring All-Reduce protocol. Unlike approaches that use a centralized memory for aggregating and broadcasting weights' updates during stochastic gradient descent (SGD) computations, Horovod takes advantage of the computing machine's communication links, such as NVLink, to maximize performance. Horovod integrates with popular modern deep learning frameworks like Keras2, TensorFlow2, PyTorch2, with a few code changes making it easy to incorporate into existing workflows.
By using Horovod, you can attempt to accelerate the distributed training time T compared to the time for a single accelerator G. However, communication can reduce scalability, and the batch size is often inversely proportional to the communication time.
The theoretical estimation for T is T=G/W+C(W), with W the number of workers, and C the communiction time between workers.
Pre-requiste
It is assumed you already use a modern deep learning framework like Tensorflow2 or PyTorch2.
Installation
ULHPC team propose either to load a previously installed or you can install it yourself.
Pre-installed approach:
source /work/projects/ulhpc-tutorials/PS10-Horovod/env.sh
We strongly recommend to use the provided environment. However, if you want to install it yourself, we provide help in the appendix 1.
The following command ensures that you can now use Horovod:
horovodrun --check-build
The command should produce the following output:
Available Frameworks:
[X] TensorFlow
[X] PyTorch
[ ] MXNet
Available Controllers:
[X] MPI
[X] Gloo
Available Tensor Operations:
[X] NCCL
[ ] DDL
[ ] CCL
[X] MPI
[X] Gloo
Ensures that the deep learning framework you wish to use is checked, along with MPI and NCCL.
Horovod typical code
The proposed codes contains the following blocks of codes:
- Initalizing Horvod. The Horovod object it is generally named "hvd". It contains collective communication primitives and callbacks.
- Adaptating the local_batch according the desired global_bath_size and the number of workers
- Pinning AI-accelerator to workers in a bijective way
- Creating data generators which will transfer asynchronously and efficiently data samples Disk->RAM->VRAM. The transfer between I/O to RAM is a "shard" and from the RAM to the VRAM the "local batch".
- Building the neural network in each GPU VRAM
- Training it
- Evaluating it (time & accuracy)
Bonus : You can add some features (e.g, Horovod callbacks) for adding more features to your code but they come with a speed overheads. Example: verbosity, monitoring the validation metric, regular checkpointing after each epoch, learning rate scheduling with loss plateau detection, ...
Proposed code samples:
Test code for checking horovod initialization
ULHPC Tensorflow/Keras code example
Official Horovod code examples
Testing multi-node multi-GPU Horovod
For testing large-scale training we launch test_horovod.py on 2 nodes, each node containing 4 GPUs
Script multinode-multigpu-test.sh:
#!/bin/sh -l
#SBATCH -c 2 # 2 CPU-core for each process
#SBATCH -N 2 # 2 nodes
#SBATCH -p gpu
#SBATCH --gpus-per-node 3 # Each process will see 3 GPUs
#SBATCH -t 30
#SBATCH --export=ALL
mpirun -n 6 python test_horovod.py
sbatch multinode-multigpu-test.sh
The SLURM correctly output:
List of TF visible physical GPUs : [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU')]
MPI_size = 6, MPI_rank = 0, MPI_local_size = 3, MPI_local_rank = 0 platform = iris-179
List of TF visible physical GPUs : [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU')]
MPI_size = 6, MPI_rank = 1, MPI_local_size = 3, MPI_local_rank = 0 platform = iris-180
List of TF visible physical GPUs : [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU')]
MPI_size = 6, MPI_rank = 2, MPI_local_size = 3, MPI_local_rank = 1 platform = iris-179
List of TF visible physical GPUs : [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU')]
MPI_size = 6, MPI_rank = 4, MPI_local_size = 3, MPI_local_rank = 2 platform = iris-179
List of TF visible physical GPUs : [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU')]
MPI_size = 6, MPI_rank = 3, MPI_local_size = 3, MPI_local_rank = 1 platform = iris-180
List of TF visible physical GPUs : [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU')]
MPI_size = 6, MPI_rank = 5, MPI_local_size = 3, MPI_local_rank = 2 platform = iris-180
Examples
Tensorflow2
Running tensorflow2 code with 1 GPU:
$ mpirun -n 1 python tensorflow_horovod.py
[...]
Epoch 4/4
195/195 - 147s - loss: 0.1119 - 147s/epoch - 753ms/step
Loss: 1.7116930484771729 accuracy: 0.4459134638309479
Running tensorflow2 code with 2 GPUs:
$ mpirun -n 2 python tensorflow_horovod.py
[...]
Epoch 4/4
Epoch 4/4
195/195 - 92s - loss: 0.1164 - 92s/epoch - 472ms/step
195/195 - 92s - loss: 0.1173 - 92s/epoch - 469ms/step
Loss: 1.3920882940292358 accuracy: 0.5380609035491943
Loss: 1.3958407640457153 accuracy: 0.5354567170143127
The training time are improved by using 2 GPUs compared to 1.
PyTorch2
Running PyTorch2 code with 1 GPU:
$ mpirun -n 1 python pytorch_horovod.py
[...]
Epoch: 4 141 sec.
Loss: -0.7153724431991577 accuracy: 0.7164999842643738
Running PyTorch2 code with 2 GPUs:
$ mpirun -n 2 python pytorch_horovod.py
[...]
Epoch: 4 85 sec.
Loss: -0.6600856781005859 accuracy: 0.6620000004768372
Epoch: 4 85 sec.
Loss: -0.6600856781005859 accuracy: 0.6620000004768372
The prediction quality remains similar (around 70% +/- 4%) however the training time with 2 GPUs is 1.65 times faster.
Going further towards scalability
Bigger batch reduce the communication need. If your are facing scalability issue increase the batch size.
- Large Batch Size (LBS) such as >1024 may hurts the convergence, for mitigating this:
- Learning Rate scheduling. This can help compensate for the challenges posed by larger batch sizes and aid in achieving better convergence.
- Adam optimizer offers better experimental results than SGD. The adaptive nature of the Adam optimizer can help alleviate some of the convergence issues associated with LBS. This blog post
- Adapting the neural network architecture for scalability. For example, some suggest that wider model can scale better: L Chen et al 2018
Appendices
Appendix 1: Install Horovod yourself
Before installing Horovod you need to get dependendices: MPI, CUDA, CUDNN, NCCL. All of them requires matching versions :)
There are some already installed software for helping you in the horovod installation quest.
The provided script sets up the environment variables required for installing Horovod's dependencies.
HPCAI_ROOT="/work/projects/ulhpc-tutorials/PS10-Horovod/"
# MPI, CUDA, and compilers
module load toolchain/intelcuda
# CUDNN
export CUDNN_PATH=${HPCAI_ROOT}/soft/cudnn/install/cudnn-linux-x86_64-8.8.1.3_cuda11-archive
export LD_LIBRARY_PATH=${CUDNN_PATH}/lib/:${LD_LIBRARY_PATH}
export CPATH=${CUDNN_PATH}/include/:${CPATH}
# NCCL
NCCL_DEBUG=INFO # allow to check if NCCL is active
# NCCL_ROOT=${HPCAI_ROOT}/miniconda/install/miniconda/lib/python3.10/site-packages/nvidia/nccl/ #nccl also present there
NCCL_ROOT=/work/projects/ulhpc-tutorials/PS10-Horovod/soft/nccl/install/nccl_2.17.1-1+cuda11.0_x86_64/
NCCL_INCLUDE_DIR=${NCCL_ROOT}/include/
NCCL_LIBRARY=${NCCL_ROOT}/lib/
LD_LIBRARY_PATH=${NCCL_ROOT}/lib/:${LD_LIBRARY_PATH}
CPATH=${NCCL_ROOT}/include/:${CPATH}
HOROVOD_NCCL_HOME=${NCCL_ROOT}
HOROVOD_NCCL_INCLUDE=${NCCL_ROOT}/include/
HOROVOD_NCCL_LIB=${NCCL_ROOT}/lib/
# HOROVOD
HOROVOD_GPU_OPERATIONS=NCCL
HOROVOD_WITH_TENSORFLOW=1
HOROVOD_WITH_PYTORCH=1
HOROVOD_WITH_MPI=1
# GIVE XLA CUDA PATH
export XLA_FLAGS=--xla_gpu_cuda_data_dir=/opt/apps/resif/iris-rhel8/2020b/gpu/software/CUDAcore/11.1.1/
Now let's install Horovod with NCCL (1 single command):
HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_NCCL_INCLUDE=$HOROVOD_NCCL_INCLUDE HOROVOD_NCCL_LIB=$HOROVOD_NCCL_LIB pip install --no-cache-dir --force-reinstall horovod
Checking Horovod
The following command ensures that you can now use Horovod:
horovodrun --check-build
The command should produce the following output:
Available Frameworks:
[X] TensorFlow
[X] PyTorch
[ ] MXNet
Available Controllers:
[X] MPI
[X] Gloo
Available Tensor Operations:
[X] NCCL
[ ] DDL
[ ] CCL
[X] MPI
[X] Gloo
Ensures that the deep learning framework you wish to use is checked, along with MPI and NCCL.
Appendix 2:
Bigger batch reduce the communication need. If your are facing scalability issue increase the batch size.
- Large Batch Size (LBS) such as >1024 may hurts the convergence, for mitigating this:
- Learning Rate scheduling. This can help compensate for the challenges posed by larger batch sizes and aid in achieving better convergence.
- Adam optimizer offers better experimental results than SGD. The adaptive nature of the Adam optimizer can help alleviate some of the convergence issues associated with LBS. https://medium.com/mini-distill/effect-of-batch-size-on-training-dynamics-21c14f7a716e
- Re-thinking the neural network architecture for scalability https://proceedings.neurips.cc/paper/2018/file/e7c573c14a09b84f6b7782ce3965f335-Paper.pdf