-- mode: markdown; mode: visual-line; fill-column: 80 --

UL HPC MPI Tutorial: High Performance Conjugate Gradients (HPCG) benchmarking on UL HPC platform

The objective of this tutorial is to compile and run one of the newest HPC benchmarks, High Performance Conjugate Gradients (HPCG), on top of the UL HPC platform.

You can work in groups for this training, yet individual work is encouraged to ensure you understand and practice the usage of MPI programs on an HPC platform. If not yet done, you should consider completing the OSU Micro-benchmark and HPL tutorials.

In all cases, ensure you are able to [connect to the UL HPC clusters]((https://hpc-docs.uni.lu/connect/access/).

# /!\ FOR ALL YOUR COMPILING BUSINESS, ENSURE YOU WORK ON A (at least half) COMPUTING NODE
# Have an interactive job
(access)$> si -n 14                                      # iris
(access)$> salloc -p interactive --qos debug -n 14       # iris (long version)
(access)$> oarsub -I -l enclosure=1/nodes=1,walltime=4   # chaos / gaia

Advanced users only: rely on screen (see tutorial or the UL HPC tutorial on the frontend prior to running any oarsub or srun/sbatch command to be more resilient to disconnection.

The latest version of this tutorial is available on Github. Finally, advanced MPI users might be interested to take a look at the Intel Math Kernel Library Link Line Advisor.

Objectives

The High Performance Conjugate Gradient HPCG project is an effort to create a more relevant metric for ranking HPC systems than the High Performance LINPACK (HPL) benchmark, which is currently used in the Top500 ranking.

HPCG exhibits the following patterns: * Dense and sparse computations * Dense and sparse collective operations * Data-driven parallelism (unstructured sparse triangular solves)

For more details, check out: * Toward a New Metric for Ranking High Performance Computing Systems * Technical specification

HPCG is written in C++, with OpenMP and MPI parallelization capabilities, thus requires a C++ compiler with OpenMP support, and/or a MPI library.

The objective of this practical session is to compare the performance obtained by running HPCG compiled with different compilers and options:

HPCG + Intel C++ + Intel MPI
- architecture native build, using the most recent supported instruction set (AVX2/FMA3)
- SSE4.1 instruction set build
HPCG + GNU C++ + Open MPI
- architecture native build, using the most recent supported instruction set (AVX2/FMA3)
- SSE4.1 instruction set build

The benchmarking tests should be performed on:

a single node
two nodes, ideally belonging to the same enclosure
two nodes, belonging to different enclosures

Executions on a single node

High Performance Conjugate Gradient (HPCG) with the Intel Suite

We are first going to use the Intel Cluster Toolkit Compiler Edition, which provides Intel C/C++ and Fortran compilers, Intel MPI & Intel MKL.

Resources:

HPCG project

Get the latest release:

$> mkdir ~/TP && cd ~/TP
$> wget http://www.hpcg-benchmark.org/downloads/hpcg-3.0.tar.gz
$> tar xvzf hpcg-3.0.tar.gz
$> cd hpcg-3.0
$> module avail MPI
$> module load toolchain/intel
$> module list
Currently Loaded Modules:
  1) compiler/GCCcore/6.3.0                   4) compiler/ifort/2017.1.132-GCC-6.3.0-2.27                 7) toolchain/iimpi/2017a
  2) tools/binutils/2.27-GCCcore-6.3.0        5) toolchain/iccifort/2017.1.132-GCC-6.3.0-2.27             8) numlib/imkl/2017.1.132-iimpi-2017a
  3) compiler/icc/2017.1.132-GCC-6.3.0-2.27   6) mpi/impi/2017.1.132-iccifort-2017.1.132-GCC-6.3.0-2.27   9) toolchain/intel/2017a
$> module show mpi/impi/2017.1.132-iccifort-2017.1.132-GCC-6.3.0-2.27

Read the INSTALL file.

In particular, you'll have to edit a new makefile Make.intel64 (inspired from setup/Make.MPI_ICPC typically), adapting:

the CXX variable specifying the C++ compiler (use mpiicpc for the MPI Intel C++ wrapper)
the CXXFLAGS variable with architecture-specific compilation flags (see this Intel article)

Once the configuration file is prepared, run the compilation with: $> make arch=intel64

Once compiled, ensure that you are able to run it:

$> cd bin
$> cat hpcg.dat
$> mkdir intel64-optimized
$> mv xhpcg intel64-optimized
$> cd intel64-optimized
$> ln -s ../hpcg.dat .
$> mpirun -hostfile $OAR_NODEFILE ./xhpcg

As configured in the default hpcg.dat, HPCG generates a synthetic discretized three-dimensional partial differential equation model problem with Nx=Ny=Nz=104 local subgrid dimensions. NPx, NPy, NPz are a factoring of the MPI process space, giving a global domain dimension of (Nx * NPx ) * (Ny * NPy ) * (Nz * NPz).

You can tune Nx, Ny, Nz to increase/decrease the problem size, yet take care not to generate a problem whose local node grid representation exceeds computing node memory.

The result of your experiments will be stored in the directory HPCG was started in, in a HPCG-Benchmark-2.4_$(date).yaml file. Check out the benchmark result (GFLOP/s) in the final summary section:

$> grep "HPCG result is" $file.yaml

In addition to the architecture optimized build, re-generate xhpcg to with the compiler options to support only the SSE4.1 instruction set (common across all UL HPC computing nodes) and perform the same experiment, in a new intel64-generic directory.

HPCG with GNU C++ and Open MPI

Re-compile HPCG with GNU C++, adapting the setup file Make.gcc from Make.Linux_MPI to use the mpicxx wrapper and the GCC specific architecture options.

$> cd ~/TP
$> make clean
$> module purge
$> module load mpi/OpenMPI
$> make arch=gcc

Once compiled, ensure you are able to run it:

$> cd bin
$> cat hpcg.dat
$> mkdir gnu-optimized
$> mv xhpcg gnu-optimized
$> cd gnu-optimized
$> ln -s ../hpcg.dat .
$> mpirun -x PATH -x LD_LIBRARY_PATH -hostfile $OAR_NODEFILE ./xhpcg

Benchmarking on two nodes

Restart the benchmarking campaign (for both the Intel and GCC) in the following context:

2 nodes belonging to the same enclosure. Use for that:

$> oarsub -l enclosure=1/nodes=2,walltime=1 […]
2 nodes belonging to the different enclosures:

$> oarsub -l enclosure=2/core=1,walltime=1 […]

Benchmarking with OpenMP active

Finally, activate OpenMP support when building HPCG, adapting for the Intel and GCC suites Make.MPI_ICPC_OMP and Make.MPI_GCC_OMP respectively. As before, perform single and multiple node benchmarks.

How is the performance result for the OpenMP+MPI vs MPI-only executions?