-- mode: markdown; mode: visual-line; fill-column: 80 --
Author: Valentin Plugaru Valentin.Plugaru@uni.lu Copyright (c) 2015-2017 UL HPC Team hpc-sysadmins@uni.lu
UL HPC MPI Tutorial: High Performance Conjugate Gradients (HPCG) benchmarking on UL HPC platform
The objective of this tutorial is to compile and run one of the newest HPC benchmarks, High Performance Conjugate Gradients (HPCG), on top of the UL HPC platform.
You can work in groups for this training, yet individual work is encouraged to ensure you understand and practice the usage of MPI programs on an HPC platform. If not yet done, you should consider completing the OSU Micro-benchmark and HPL tutorials.
In all cases, ensure you are able to [connect to the UL HPC clusters]((https://hpc-docs.uni.lu/connect/access/).
# /!\ FOR ALL YOUR COMPILING BUSINESS, ENSURE YOU WORK ON A (at least half) COMPUTING NODE
# Have an interactive job
(access)$> si -n 14 # iris
(access)$> salloc -p interactive --qos debug -n 14 # iris (long version)
(access)$> oarsub -I -l enclosure=1/nodes=1,walltime=4 # chaos / gaia
Advanced users only: rely on screen
(see tutorial or the UL HPC tutorial on the frontend prior to running any oarsub
or srun/sbatch
command to be more resilient to disconnection.
The latest version of this tutorial is available on Github. Finally, advanced MPI users might be interested to take a look at the Intel Math Kernel Library Link Line Advisor.
Objectives
The High Performance Conjugate Gradient HPCG project is an effort to create a more relevant metric for ranking HPC systems than the High Performance LINPACK (HPL) benchmark, which is currently used in the Top500 ranking.
HPCG exhibits the following patterns: * Dense and sparse computations * Dense and sparse collective operations * Data-driven parallelism (unstructured sparse triangular solves)
For more details, check out: * Toward a New Metric for Ranking High Performance Computing Systems * Technical specification
HPCG is written in C++, with OpenMP and MPI parallelization capabilities, thus requires a C++ compiler with OpenMP support, and/or a MPI library.
The objective of this practical session is to compare the performance obtained by running HPCG compiled with different compilers and options:
- HPCG + Intel C++ + Intel MPI
- architecture native build, using the most recent supported instruction set (AVX2/FMA3)
- SSE4.1 instruction set build
- HPCG + GNU C++ + Open MPI
- architecture native build, using the most recent supported instruction set (AVX2/FMA3)
- SSE4.1 instruction set build
The benchmarking tests should be performed on:
- a single node
- two nodes, ideally belonging to the same enclosure
- two nodes, belonging to different enclosures
Executions on a single node
High Performance Conjugate Gradient (HPCG) with the Intel Suite
We are first going to use the Intel Cluster Toolkit Compiler Edition, which provides Intel C/C++ and Fortran compilers, Intel MPI & Intel MKL.
Resources:
Get the latest release:
$> mkdir ~/TP && cd ~/TP
$> wget http://www.hpcg-benchmark.org/downloads/hpcg-3.0.tar.gz
$> tar xvzf hpcg-3.0.tar.gz
$> cd hpcg-3.0
$> module avail MPI
$> module load toolchain/intel
$> module list
Currently Loaded Modules:
1) compiler/GCCcore/6.3.0 4) compiler/ifort/2017.1.132-GCC-6.3.0-2.27 7) toolchain/iimpi/2017a
2) tools/binutils/2.27-GCCcore-6.3.0 5) toolchain/iccifort/2017.1.132-GCC-6.3.0-2.27 8) numlib/imkl/2017.1.132-iimpi-2017a
3) compiler/icc/2017.1.132-GCC-6.3.0-2.27 6) mpi/impi/2017.1.132-iccifort-2017.1.132-GCC-6.3.0-2.27 9) toolchain/intel/2017a
$> module show mpi/impi/2017.1.132-iccifort-2017.1.132-GCC-6.3.0-2.27
Read the INSTALL
file.
In particular, you'll have to edit a new makefile Make.intel64
(inspired from setup/Make.MPI_ICPC
typically), adapting:
- the CXX variable specifying the C++ compiler (use
mpiicpc
for the MPI Intel C++ wrapper) - the CXXFLAGS variable with architecture-specific compilation flags (see this Intel article)
Once the configuration file is prepared, run the compilation with: $> make arch=intel64
Once compiled, ensure that you are able to run it:
$> cd bin
$> cat hpcg.dat
$> mkdir intel64-optimized
$> mv xhpcg intel64-optimized
$> cd intel64-optimized
$> ln -s ../hpcg.dat .
$> mpirun -hostfile $OAR_NODEFILE ./xhpcg
As configured in the default hpcg.dat
, HPCG generates a synthetic discretized three-dimensional partial differential equation model problem with Nx=Ny=Nz=104 local subgrid dimensions. NPx, NPy, NPz are a factoring of the MPI process space, giving a global domain dimension of (Nx * NPx ) * (Ny * NPy ) * (Nz * NPz).
You can tune Nx, Ny, Nz to increase/decrease the problem size, yet take care not to generate a problem whose local node grid representation exceeds computing node memory.
The result of your experiments will be stored in the directory HPCG was started in, in a HPCG-Benchmark-2.4_$(date).yaml
file. Check out the benchmark result (GFLOP/s) in the final summary section:
$> grep "HPCG result is" $file.yaml
In addition to the architecture optimized build, re-generate xhpcg to with the compiler options to support only the SSE4.1 instruction set (common across all UL HPC computing nodes) and perform the same experiment, in a new intel64-generic
directory.
HPCG with GNU C++ and Open MPI
Re-compile HPCG with GNU C++, adapting the setup file Make.gcc
from Make.Linux_MPI
to use the mpicxx
wrapper and the GCC specific architecture options.
$> cd ~/TP
$> make clean
$> module purge
$> module load mpi/OpenMPI
$> make arch=gcc
Once compiled, ensure you are able to run it:
$> cd bin
$> cat hpcg.dat
$> mkdir gnu-optimized
$> mv xhpcg gnu-optimized
$> cd gnu-optimized
$> ln -s ../hpcg.dat .
$> mpirun -x PATH -x LD_LIBRARY_PATH -hostfile $OAR_NODEFILE ./xhpcg
Benchmarking on two nodes
Restart the benchmarking campaign (for both the Intel and GCC) in the following context:
-
2 nodes belonging to the same enclosure. Use for that:
$> oarsub -l enclosure=1/nodes=2,walltime=1 […]
-
2 nodes belonging to the different enclosures:
$> oarsub -l enclosure=2/core=1,walltime=1 […]
Benchmarking with OpenMP active
Finally, activate OpenMP support when building HPCG, adapting for the Intel and GCC suites Make.MPI_ICPC_OMP
and Make.MPI_GCC_OMP
respectively.
As before, perform single and multiple node benchmarks.
How is the performance result for the OpenMP+MPI vs MPI-only executions?