High-Performance Linpack (HPL) benchmarking on UL HPC platform

 Copyright (c) 2013-2021 UL HPC Team  <hpc-sysadmins@uni.lu>

The objective of this tutorial is to compile and run on of the reference HPC benchmarks, HPL, on top of the UL HPC platform. Kindly ensure your followed the "Scalable Science and Parallel computations with OpenMP/MPI" tutorial

The latest version of this tutorial is available on Github and on http://ulhpc-tutorials.readthedocs.io/en/latest/parallel/mpi/HPL/

Resources

Tweak HPL parameters
HPL Calculator to find good parameters and expected performances
Intel Math Kernel Library Link Line Advisor

Pre-requisites

If not yet done, you'll need to pull the latest changes in your working copy of the ULHPC/tutorials you should have cloned in ~/git/github.com/ULHPC/tutorials (see "preliminaries" tutorial)

(access)$ cd ~/git/github.com/ULHPC/tutorials
(access)$ git pull

Now configure a dedicated directory ~/tutorials/HPL for this session

(access)$ mkdir -p ~/tutorials/HPL
(access)$ cd ~/tutorials/HPL
# create a symbolic link to the top reference material
(access)$ ln -s ~/git/github.com/ULHPC/tutorials/parallel/mpi/HPL ref.d
# create other convenient symlinks
(access)$ ln -s ref.d/Makefile .     # symlink to the root Makefile

Advanced users (eventually yet strongly recommended), create a Tmux session (see Tmux cheat sheet and tutorial) or GNU Screen session you can recover later. See also "Getting Started" tutorial .

Theoretical Peak Performance R_peak

The ULHPC computing nodes on aion or iris feature the following types of processors (see also /etc/motd on the access node):

Cluster	Vendor	Model	#cores	TDP	Freq.	AVX512 Freq	Nodes	R_peak	R_peak
Aion	AMD	AMD Epyc ROME 7H12	64	280W	2.6 GHz	n/a	`aion-[0001-0318]`	2.66 TF	2.13 TF
Iris	Intel	Xeon E5-2680v4 (broadwell)	14	120W	2.4Ghz	n/a	`iris-[001-108]`	0.538 TF	0.46 TF
Iris	Intel	Xeon Gold 6132 (skylake)	14	140W	2.6GHz	2.3 GHz	`iris-[109-186,191-196]`	1.03 TF	0.88 TF
Iris	Intel	Xeon Platinum 8180M (skylake)	28	205W	2.5GHz	2.3 GHz	`iris-[187-190]`	2.06 TF	1.75 TF

Computing the theoretical peak performance of these processors is done using the following formulae:

R_peak = #Cores x [AVX512 All cores Turbo] Frequency x #DP_ops_per_cycle

Knowing that:

Broadwell processors (iris-[001-108] nodes) carry on 16 DP ops/cycle and supports AVX2/FMA3.
Skylake processors (iris-[109-196] nodes) belongs to the Gold or Platinum family and thus have two AVX512 units, thus they are capable of performing 32 Double Precision (DP) Flops/cycle. From the reference Intel documentation, it is possible to extract for the featured model the AVX-512 Turbo Frequency (i.e., the maximum core frequency in turbo mode) in place of the base non-AVX core frequency that can be used to compute the peak performance (see Fig. 3 p.14).
AMD Epyc processors carry on 16 Double Precision (DP) ops/cycle.

HPL permits to measure the effective R_max performance (as opposed to the above peak performance R_peak). The ratio R_max/R_peak corresponds to the HPL efficiency.

Objectives

HPL is a portable implementation of the High-Performance Linpack (HPL) Benchmark for Distributed-Memory Computers. It is used as reference benchmark to provide data for the Top500 list and thus rank to supercomputers worldwide. HPL rely on an efficient implementation of the Basic Linear Algebra Subprograms (BLAS). You have several choices at this level:

Intel MKL
ATLAS
GotoBlas

The idea is to compare the different MPI and BLAS implementations:

Intel MPI and the Intel MKL
OpenMPI
ATLAS
GotoBlas

For the sake of time and simplicity, we will focus on the combination expected to lead to the best performant runs, i.e. Intel MKL and Intel MPI suite, either in full MPI or in hybrid run (on 1 or 2 nodes). As a bonus, comparison with the reference HPL binary compiled as part of the toolchain/intel will be considered.

Fetching the HPL sources

In the working directory ~/tutorials/HPL, fetch and uncompress the latest version of the HPL benchmark (i.e. version 2.3 at the time of writing).

$ cd ~/tutorials/HPL
$ mkdir src
# Download the sources
$ cd src
# Download the latest version
$ export HPL_VERSION=2.3
$ wget --no-check-certificate http://www.netlib.org/benchmark/hpl/hpl-${HPL_VERSION}.tar.gz
$ tar xvzf hpl-${HPL_VERSION}.tar.gz
$ cd  hpl-${HPL_VERSION}

Alternatively, you can use the following command to fetch and uncompress the HPL sources:

$ cd ~/tutorials/HPL
$ make fetch
$ make uncompress

Building the HPL benchmark

We are first going to use the Intel Cluster Toolkit Compiler Edition, which provides Intel C/C++ and Fortran compilers, Intel MPI.

$ cd ~/tutorials/HPL
# Copy the provided Make.intel64
$ cp ref.d/src/Make.intel64 src/

Now you can reserve an interactive job for the compilation from the access server:

# Quickly get one interactive job for 1h
$ si -N 2 --ntasks-per-node 2
# OR get one interactive (totalling 2*2 MPI processes) on broadwell-based nodes
$ si -C broadwell -N2 --ntasks-per-node 2
# OR get one interactive (totalling 2*2 MPI processes) on skylake-based nodes
$ si -C skylake -N2 --ntasks-per-node 2

Now that you are on a computing node, you can load the appropriate module for Intel MKL and Intel MPI suite, i.e. toolchain/intel:

# Load the appropriate module
$ module load toolchain/intel
$ module list

Intel MKL is now loaded.

Read the INSTALL file under src/hpl-2.3. In particular, you'll have to edit and adapt a new makefile Make.intel64 (inspired from setup/Make.Linux_Intel64 typically) and provided to you provided to you on Github for that purpose.

cd src/hpl-2.3
cp ../Make.intel64 .
# OR (if the above command fails)
# cp ~/git/github.com/ULHPC/tutorials/parallel/mpi/HPL/src/Make.intel64  Make.intel64
# Automatically adapt at least the TOPdir variable to the current directory $(pwd),
# thus it SHOULD be run from 'src/hpl-2.3'
sed -i \
  -e "s#^[[:space:]]*TOPdir[[:space:]]*=[[:space:]]*.*#TOPdir = $(pwd)#" \
  Make.intel64
# Check the difference:
$ diff -ru ../Make.intel64 Make.intel64
--- ../Make.intel64     2019-11-19 23:43:26.668794000 +0100
+++ Make.intel64        2019-11-20 00:33:21.077914972 +0100
@@ -68,7 +68,7 @@
 # - HPL Directory Structure / HPL library ------------------------------
 # ----------------------------------------------------------------------
 #
-TOPdir       = $(HOME)/benchmarks/HPL/src/hpl-2.3
+TOPdir = /home/users/svarrette/tutorials/HPL/src/hpl-2.3
 INCdir       = $(TOPdir)/include
 BINdir       = $(TOPdir)/bin/$(ARCH)
 LIBdir       = $(TOPdir)/lib/$(ARCH)

In general, to build HPL, you first need to configure correctly the file Make.intel64. Take your favorite editor (vim, nano, etc.) to modify it. In particular, you should adapt:

TOPdir to point to the directory holding the HPL sources (i.e. where you uncompress them: $(HOME)/tutorials/HPL/src/hpl-2.3)
- this was done using the above sed command
Adapt the MP* variables to point to the appropriate MPI libraries path.
Correct the OpenMP definitions OMP_DEFS
(eventually) adapt the CCFLAGS
in particular, with the Intel compiling suite, you SHOULD at least add -xHost to ensure the compilation that will auto-magically use the appropriate compilation flags -- see (again) the Intel Math Kernel Library Link Line Advisor
(eventually) adapt the ARCH variable

Here is for instance a suggested difference for intel MPI:

--- setup/Make.Linux_Intel64    1970-01-01 06:00:00.000000000 +0100
+++ Make.intel64        2019-11-20 00:15:11.938815000 +0100
@@ -61,13 +61,13 @@
 # - Platform identifier ------------------------------------------------
 # ----------------------------------------------------------------------
 #
-ARCH         = Linux_Intel64
+ARCH         = $(arch)
 #
 # ----------------------------------------------------------------------
 # - HPL Directory Structure / HPL library ------------------------------
 # ----------------------------------------------------------------------
 #
-TOPdir       = $(HOME)/hpl
+TOPdir       = $(HOME)/tutorials/HPL/src/hpl-2.3
 INCdir       = $(TOPdir)/include
 BINdir       = $(TOPdir)/bin/$(ARCH)
 LIBdir       = $(TOPdir)/lib/$(ARCH)
@@ -81,9 +81,9 @@
 # header files,  MPlib  is defined  to be the name of  the library to be
 # used. The variable MPdir is only used for defining MPinc and MPlib.
 #
-# MPdir        = /opt/intel/mpi/4.1.0
-# MPinc        = -I$(MPdir)/include64
-# MPlib        = $(MPdir)/lib64/libmpi.a
+MPdir        = $(I_MPI_ROOT)/intel64
+MPinc        = -I$(MPdir)/include
+MPlib        = $(MPdir)/lib/libmpi.a
 #
 # ----------------------------------------------------------------------
 # - Linear Algebra library (BLAS or VSIPL) -----------------------------
@@ -177,9 +178,9 @@
 #
 CC       = mpiicc
 CCNOOPT  = $(HPL_DEFS)
-OMP_DEFS = -openmp
-CCFLAGS  = $(HPL_DEFS) -O3 -w -ansi-alias -i-static -z noexecstack -z relro -z now -nocompchk -Wall
-#
+OMP_DEFS = -qopenmp
+CCFLAGS  = $(HPL_DEFS) -O3 -w -ansi-alias -i-static -z noexecstack -z relro -z now -nocompchk -Wall -xHost
+
 #
 # On some platforms,  it is necessary  to use the Fortran linker to find
 # the Fortran internals used in the BLAS library.

Once tweaked, run the compilation by:

$> make arch=intel64 clean_arch_all
$> make arch=intel64

If you don't succeed by yourself, use the following Make.intel64.

Once compiled, ensure you are able to run it (you will need at least 4 MPI processes -- for instance with -N 2 --ntasks-per-node 2):

$> cd ~/tutorials/HPL/src/hpl-2.3/bin/intel64
$> cat HPL.dat      # Default (dummy) HPL.dat  input file

# On Slurm cluster, store the output logs into a text file -- see tee
$> srun -n $SLURM_NTASKS ./xhpl | tee test_run.logs

Check the output results with less test_run.logs. You can also quickly see the 10 best results obtained by using:

# ================================================================================
# T/V             N      NB    P     Q               Time             Gflops
# --------------------------------------------------------------------------------
$> grep WR test_run.logs | sort -k 7 -n -r | head -n 10
WR00L2L2          29     3     4     1               0.00             9.9834e-03
WR00L2R2          35     2     4     1               0.00             9.9808e-03
WR00R2R4          35     2     4     1               0.00             9.9512e-03
WR00L2C2          30     2     1     4               0.00             9.9436e-03
WR00R2C2          35     2     4     1               0.00             9.9411e-03
WR00R2R2          35     2     4     1               0.00             9.9349e-03
WR00R2R2          30     2     1     4               0.00             9.8879e-03
WR00R2C4          30     2     1     4               0.00             9.8771e-03
WR00C2R2          35     2     4     1               0.00             9.8323e-03
WR00L2C4          29     3     4     1               0.00             9.8049e-03

Alternatively, you can use the building script scripts/build.HPL to build the HPL sources on both broadwell and skylake nodes (with the corresponding architectures):

# (eventually) release you past interactive job to return on access
$> exit

$> cd ~/tutorials/HPL
# Create symlink to the scripts directory
$> ln -s ref.d/scripts .
# Create a logs/ directory to store the Slurm logs
$> mkdir logs

# Now submit two building jobs targeting both CPU architecture
$> sbatch -C broadwell ./scripts/build.HPL -n broadwell  # Will produce bin/xhpl_broadwell
$> sbatch -C skylake   ./scripts/build.HPL -n skylake    # Will produce bin/xhpl_skylake

Preparing batch runs

We are now going to prepare launcher scripts to permit passive runs (typically in the {default | batch} queue). We will place them in a separate directory (runs/) as it will host the outcomes of the executions on the UL HPC platform .

$> cd ~/tutorials/HPL
$> mkdir -p runs/{broadwell,skylake}/{1N,2N}/{MPI,Hybrid}/    # Prepare the specific run directory
$> cp ref.d/

We are indeed going to run HPL in two different contexts:

Full MPI, with 1 MPI process per (physical) core reserved.
- As mentioned in the basics Parallel computations with OpenMP/MPI tutorial, it means that you'll typically reserve the nodes using the -N <#nodes> --ntasks-per-node 28 options for Slurm as there are in general 28 cores per nodes on iris.
Hybrid OpenMP+MPI, with 1 MPI process per CPU socket, and as many OpenMP threads as per (physical) core reserved.
- As mentioned in the basics Parallel computations with OpenMP/MPI tutorial, it means that you'll typically reserve the nodes using the -N <#nodes> --ntasks-per-node 2 --ntasks-per-socket 1 -c 14 options for Slurm there are in general 2 processors (each with 14 cores) per nodes on iris

These two contexts will directly affect the values for the HPL parameters P and Q since their product should match the total number of MPI processes.

HPL main parameters

Running HPL depends on a configuration file HPL.dat -- an example is provided in the building directory i.e. src/hpl-2.3/bin/intel64/HPL.dat.

$> cat src/hpl-2.3/bin/intel64/HPL.dat
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
4            # of problems sizes (N)
29 30 34 35  Ns
4            # of NBs
1 2 3 4      NBs
0            PMAP process mapping (0=Row-,1=Column-major)
3            # of process grids (P x Q)
2 1 4        Ps
2 4 1        Qs
16.0         threshold
3            # of panel fact
0 1 2        PFACTs (0=left, 1=Crout, 2=Right)
2            # of recursive stopping criterium
2 4          NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
3            # of recursive panel fact.
0 1 2        RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
0            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
0            DEPTHs (>=0)
2            SWAP (0=bin-exch,1=long,2=mix)
64           swapping threshold
0            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)

See http://www.netlib.org/benchmark/hpl/tuning.html for a description of this file and its parameters (see also the authors tips).

You can use the following sites for finding the appropriate values:

Tweak HPL parameters
HPL Calculator to find good parameters and expected performances

The main parameters to play with for optimizing the HPL runs are:

NB: depends on the CPU architecture, use the recommended blocking sizes (NB in HPL.dat) listed after loading the toolchain/intel module under $EBROOTIMKL/compilers_and_libraries/linux/mkl/benchmarks/mp_linpack/readme.txt, i.e
- NB=192 for the broadwell processors available on iris
- NB=384 on the skylake processors available on iris
P and Q, knowing that the product P x Q SHOULD typically be equal to the number of MPI processes.
Of course N the problem size.

An example of P by Q partitioning of a HPL matrix in 6 processes (2x3 decomposition) (Source )

In order to find out the best performance of your system, the largest problem size fitting in memory is what you should aim for. Since HPL performs computation on an N x N array of Double Precision (DP) elements, and that each double precision element requires sizeof(double) = 8 bytes, the memory consumed for a problem size of N is $8N^2$ .

It follows that N can be derived from a simple dimensional analysis based on the involved volatile memory to compute the number of Double Precision :

$N \simeq \alpha\sqrt{\frac{\text{Total Memory Size in bytes}}{\mathtt{sizeof(double)}}} = \alpha\sqrt{\frac{\#Nodes \times RAMsize (GiB) \times 1024^3 }{\mathtt{sizeof(double)}}}$

where $\alpha$ is a global ratio normally set to (at least) 80% (best results are typically obtained with $\alpha > 92$ %).

Alternatively, one can target a ratio $\beta$ of the total memory used (for instance 85%), i.e.

$N \simeq \sqrt{\beta\times\frac{\text{Total Memory Size in bytes}}{\mathtt{sizeof(double)}}}$

Note that the two ratios you might consider are of course linked, i.e. $\beta = \alpha^2$

Finally, the problem size should be ideally set to a multiple of the block size NB.

Example of HPL parameters we are going to try (when using regular nodes on the batch partition) are proposed on the below table. Note that we will use on purpose a relatively low value for the ratio $\alpha$ (or $\beta$ ), and thus N, to ensure relative fast runs within the time of this tutorial.

Architecture	#Node	Mode	MPI proc.	NB	PxQ	$\alpha$	N
broadwell	1	MPI	28	192	1x28, 2x14, 4x7	0.3	39360
broadwell	2	MPI	56	192	1x56, 2x28, 4x14, 7x8	0.3	55680
broadwell	1	Hybrid	2	192	1x2	0.3	39360
broadwell	2	Hybrid	4	192	1x2	0.3	55680

skylake	1	MPI	28	384	1x28, 2x14, 4x7	0.3	39168
skylake	2	MPI	56	384	1x56, 2x28, 4x14, 7x8	0.3	55680
skylake	1	Hybrid	2	384	1x2	0.3	39168
skylake	2	Hybrid	4	384	1x2	0.3	55680

You can use the script scripts/compute_N to compute the value of N depending on the global ratio $\alpha$ (using -r <alpha>) or $\beta$ (using -p <beta*100>).

./scripts/compute_N -h
# 1 Broadwell node, alpha = 0.3
./scripts/compute_N -m 128 -NB 192 -r 0.3 -N 1
# 2 Skylake (regular) nodes, alpha = 0.3
./scripts/compute_N -m 128 -NB 384 -r 0.3 -N 2
# 4 bigmem (skylake) nodes, beta = 0.85
./scripts/compute_N -m 3072 -NB 384 -p 85 -N 4

Using the above values, create the appropriate HPL.dat files for each case, under the appropriate directory, i.e. runs/<arch>/<N>N/

Slurm launcher (Intel MPI)

Copy and adapt the default MPI SLURM launcher you should have a copy in ~/git/ULHPC/launcher-scripts/slurm/launcher.default.sh

Copy and adapt the default SLURM launcher you should have a copy in ~/git/ULHPC/launcher-scripts/slurm/launcher.default.sh

$> cd ~/tutorials/HPL/runs
# Prepare a laucnher for intel suit
$> cp ~/git/github.com/ULHPC/launcher-scripts/slurm/launcher.default.sh launcher-HPL.intel.sh

Take your favorite editor (vim, nano, etc.) to modify it according to your needs.

Here is for instance a suggested difference for intel MPI (adapt accordingly):

--- ~/git/ULHPC/launcher-scripts/slurm/launcher.default.sh  2017-06-11 23:40:34.007152000 +0200
+++ launcher-HPL.intel.sh       2017-06-11 23:41:57.597055000 +0200
@@ -10,8 +10,8 @@
 #
 #          Set number of resources
 #
-#SBATCH -N 1
+#SBATCH -N 2
 #SBATCH --ntasks-per-node=28
 ### -c, --cpus-per-task=<ncpus>
 ###     (multithreading) Request that ncpus be allocated per process
 #SBATCH -c 1
@@ -64,15 +64,15 @@
 module load toolchain/intel

 # Directory holding your built applications
-APPDIR="$HOME"
+APPDIR="$HOME/tutorials/HPL/src/hpl-2.3/bin/intel64"
 # The task to be executed i.E. your favorite Java/C/C++/Ruby/Perl/Python/R/whatever program
 # to be invoked in parallel
-TASK="${APPDIR}/app.exe"
+TASK="${APPDIR}/xhpl"

 # The command to run
-CMD="${TASK}"
+# CMD="${TASK}"
 ### General MPI Case:
-# CMD="srun -n $SLURM_NTASKS ${TASK}"
+CMD="srun -n $SLURM_NTASKS ${TASK}"
 ### OpenMPI case if you wish to specialize the MCA parameters
 #CMD="mpirun -np $SLURM_NTASKS --mca btl openib,self,sm ${TASK}"

Now you should create an input HPL.dat file within the runs/<arch>/<N>N/<mode>.

$> cd ~/tutorials/HPL/runs
$> cp ../ref.d/HPL.dat .
$> ll
total 0
-rw-r--r--. 1 svarrette clusterusers 1.5K Jun 12 15:38 HPL.dat
-rwxr-xr-x. 1 svarrette clusterusers 2.7K Jun 12 15:25 launcher-HPL.intel.sh

You are ready for testing a batch job:

$> cd ~/tutorials/HPL/runs
$> sbatch ./launcher-HPL.intel.sh
$> sq     # OR (long version) squeue -u $USER

(bonus) Connect to one of the allocated nodes and run htop (followed by u to select process run under your username, and F5 to enable the tree-view.

Now you can check the output of the HPL runs:

$> grep WR slurm-<jobid>.out    # /!\ ADAPT <jobid> appropriately.

Of course, we made here a small test and optimizing the HPL parameters to get the best performances and efficiency out of a given HPC platform is not easy.

Below are some plots obtained when benchmarking the iris cluster and seeking the best set of parameters across increasing number of nodes (see this blog post)