High-Performance Linpack (HPL) benchmarking on UL HPC platform
Copyright (c) 2013-2021 UL HPC Team <hpc-sysadmins@uni.lu>
The objective of this tutorial is to compile and run on of the reference HPC benchmarks, HPL, on top of the UL HPC platform. Kindly ensure your followed the "Scalable Science and Parallel computations with OpenMP/MPI" tutorial
The latest version of this tutorial is available on Github and on http://ulhpc-tutorials.readthedocs.io/en/latest/parallel/mpi/HPL/
Resources
- Tweak HPL parameters
- HPL Calculator to find good parameters and expected performances
- Intel Math Kernel Library Link Line Advisor
Pre-requisites
If not yet done, you'll need to pull the latest changes in your working copy of the ULHPC/tutorials you should have cloned in ~/git/github.com/ULHPC/tutorials
(see "preliminaries" tutorial)
(access)$ cd ~/git/github.com/ULHPC/tutorials
(access)$ git pull
Now configure a dedicated directory ~/tutorials/HPL
for this session
(access)$ mkdir -p ~/tutorials/HPL
(access)$ cd ~/tutorials/HPL
# create a symbolic link to the top reference material
(access)$ ln -s ~/git/github.com/ULHPC/tutorials/parallel/mpi/HPL ref.d
# create other convenient symlinks
(access)$ ln -s ref.d/Makefile . # symlink to the root Makefile
Advanced users (eventually yet strongly recommended), create a Tmux session (see Tmux cheat sheet and tutorial) or GNU Screen session you can recover later. See also "Getting Started" tutorial .
Theoretical Peak Performance Rpeak
The ULHPC computing nodes on aion or iris feature the following types of processors (see also /etc/motd
on the access node):
Cluster | Vendor | Model | #cores | TDP | Freq. | AVX512 Freq | Nodes | Rpeak | Rpeak |
---|---|---|---|---|---|---|---|---|---|
Aion | AMD | AMD Epyc ROME 7H12 | 64 | 280W | 2.6 GHz | n/a | aion-[0001-0318] |
2.66 TF | 2.13 TF |
Iris | Intel | Xeon E5-2680v4 (broadwell) | 14 | 120W | 2.4Ghz | n/a | iris-[001-108] |
0.538 TF | 0.46 TF |
Iris | Intel | Xeon Gold 6132 (skylake) | 14 | 140W | 2.6GHz | 2.3 GHz | iris-[109-186,191-196] |
1.03 TF | 0.88 TF |
Iris | Intel | Xeon Platinum 8180M (skylake) | 28 | 205W | 2.5GHz | 2.3 GHz | iris-[187-190] |
2.06 TF | 1.75 TF |
Computing the theoretical peak performance of these processors is done using the following formulae:
Rpeak = #Cores x [AVX512 All cores Turbo] Frequency x #DP_ops_per_cycle
Knowing that:
- Broadwell processors (
iris-[001-108]
nodes) carry on 16 DP ops/cycle and supports AVX2/FMA3. - Skylake processors (
iris-[109-196]
nodes) belongs to the Gold or Platinum family and thus have two AVX512 units, thus they are capable of performing 32 Double Precision (DP) Flops/cycle. From the reference Intel documentation, it is possible to extract for the featured model the AVX-512 Turbo Frequency (i.e., the maximum core frequency in turbo mode) in place of the base non-AVX core frequency that can be used to compute the peak performance (see Fig. 3 p.14). - AMD Epyc processors carry on 16 Double Precision (DP) ops/cycle.
HPL permits to measure the effective Rmax performance (as opposed to the above peak performance Rpeak). The ratio Rmax/Rpeak corresponds to the HPL efficiency.
Objectives
HPL is a portable implementation of the High-Performance Linpack (HPL) Benchmark for Distributed-Memory Computers. It is used as reference benchmark to provide data for the Top500 list and thus rank to supercomputers worldwide. HPL rely on an efficient implementation of the Basic Linear Algebra Subprograms (BLAS). You have several choices at this level:
The idea is to compare the different MPI and BLAS implementations:
For the sake of time and simplicity, we will focus on the combination expected to lead to the best performant runs, i.e. Intel MKL and Intel MPI suite, either in full MPI or in hybrid run (on 1 or 2 nodes).
As a bonus, comparison with the reference HPL binary compiled as part of the toolchain/intel
will be considered.
Fetching the HPL sources
In the working directory ~/tutorials/HPL
, fetch and uncompress the latest version of the HPL benchmark (i.e. version 2.3 at the time of writing).
$ cd ~/tutorials/HPL
$ mkdir src
# Download the sources
$ cd src
# Download the latest version
$ export HPL_VERSION=2.3
$ wget --no-check-certificate http://www.netlib.org/benchmark/hpl/hpl-${HPL_VERSION}.tar.gz
$ tar xvzf hpl-${HPL_VERSION}.tar.gz
$ cd hpl-${HPL_VERSION}
Alternatively, you can use the following command to fetch and uncompress the HPL sources:
$ cd ~/tutorials/HPL
$ make fetch
$ make uncompress
Building the HPL benchmark
We are first going to use the Intel Cluster Toolkit Compiler Edition, which provides Intel C/C++ and Fortran compilers, Intel MPI.
$ cd ~/tutorials/HPL
# Copy the provided Make.intel64
$ cp ref.d/src/Make.intel64 src/
Now you can reserve an interactive job for the compilation from the access server:
# Quickly get one interactive job for 1h
$ si -N 2 --ntasks-per-node 2
# OR get one interactive (totalling 2*2 MPI processes) on broadwell-based nodes
$ si -C broadwell -N2 --ntasks-per-node 2
# OR get one interactive (totalling 2*2 MPI processes) on skylake-based nodes
$ si -C skylake -N2 --ntasks-per-node 2
Now that you are on a computing node, you can load the appropriate module for Intel MKL and Intel MPI suite, i.e. toolchain/intel
:
# Load the appropriate module
$ module load toolchain/intel
$ module list
Intel MKL is now loaded.
Read the INSTALL
file under src/hpl-2.3
. In particular, you'll have to edit and adapt a new makefile Make.intel64
(inspired from setup/Make.Linux_Intel64
typically) and provided to you provided to you on Github for that purpose.
cd src/hpl-2.3
cp ../Make.intel64 .
# OR (if the above command fails)
# cp ~/git/github.com/ULHPC/tutorials/parallel/mpi/HPL/src/Make.intel64 Make.intel64
# Automatically adapt at least the TOPdir variable to the current directory $(pwd),
# thus it SHOULD be run from 'src/hpl-2.3'
sed -i \
-e "s#^[[:space:]]*TOPdir[[:space:]]*=[[:space:]]*.*#TOPdir = $(pwd)#" \
Make.intel64
# Check the difference:
$ diff -ru ../Make.intel64 Make.intel64
--- ../Make.intel64 2019-11-19 23:43:26.668794000 +0100
+++ Make.intel64 2019-11-20 00:33:21.077914972 +0100
@@ -68,7 +68,7 @@
# - HPL Directory Structure / HPL library ------------------------------
# ----------------------------------------------------------------------
#
-TOPdir = $(HOME)/benchmarks/HPL/src/hpl-2.3
+TOPdir = /home/users/svarrette/tutorials/HPL/src/hpl-2.3
INCdir = $(TOPdir)/include
BINdir = $(TOPdir)/bin/$(ARCH)
LIBdir = $(TOPdir)/lib/$(ARCH)
In general, to build HPL, you first need to configure correctly the file Make.intel64
.
Take your favorite editor (vim
, nano
, etc.) to modify it. In particular, you should adapt:
TOPdir
to point to the directory holding the HPL sources (i.e. where you uncompress them:$(HOME)/tutorials/HPL/src/hpl-2.3
)- this was done using the above
sed
command
- this was done using the above
- Adapt the
MP*
variables to point to the appropriate MPI libraries path. - Correct the OpenMP definitions
OMP_DEFS
- (eventually) adapt the
CCFLAGS
- in particular, with the Intel compiling suite, you SHOULD at least add
-xHost
to ensure the compilation that will auto-magically use the appropriate compilation flags -- see (again) the Intel Math Kernel Library Link Line Advisor - (eventually) adapt the
ARCH
variable
Here is for instance a suggested difference for intel MPI:
--- setup/Make.Linux_Intel64 1970-01-01 06:00:00.000000000 +0100
+++ Make.intel64 2019-11-20 00:15:11.938815000 +0100
@@ -61,13 +61,13 @@
# - Platform identifier ------------------------------------------------
# ----------------------------------------------------------------------
#
-ARCH = Linux_Intel64
+ARCH = $(arch)
#
# ----------------------------------------------------------------------
# - HPL Directory Structure / HPL library ------------------------------
# ----------------------------------------------------------------------
#
-TOPdir = $(HOME)/hpl
+TOPdir = $(HOME)/tutorials/HPL/src/hpl-2.3
INCdir = $(TOPdir)/include
BINdir = $(TOPdir)/bin/$(ARCH)
LIBdir = $(TOPdir)/lib/$(ARCH)
@@ -81,9 +81,9 @@
# header files, MPlib is defined to be the name of the library to be
# used. The variable MPdir is only used for defining MPinc and MPlib.
#
-# MPdir = /opt/intel/mpi/4.1.0
-# MPinc = -I$(MPdir)/include64
-# MPlib = $(MPdir)/lib64/libmpi.a
+MPdir = $(I_MPI_ROOT)/intel64
+MPinc = -I$(MPdir)/include
+MPlib = $(MPdir)/lib/libmpi.a
#
# ----------------------------------------------------------------------
# - Linear Algebra library (BLAS or VSIPL) -----------------------------
@@ -177,9 +178,9 @@
#
CC = mpiicc
CCNOOPT = $(HPL_DEFS)
-OMP_DEFS = -openmp
-CCFLAGS = $(HPL_DEFS) -O3 -w -ansi-alias -i-static -z noexecstack -z relro -z now -nocompchk -Wall
-#
+OMP_DEFS = -qopenmp
+CCFLAGS = $(HPL_DEFS) -O3 -w -ansi-alias -i-static -z noexecstack -z relro -z now -nocompchk -Wall -xHost
+
#
# On some platforms, it is necessary to use the Fortran linker to find
# the Fortran internals used in the BLAS library.
Once tweaked, run the compilation by:
$> make arch=intel64 clean_arch_all
$> make arch=intel64
If you don't succeed by yourself, use the following Make.intel64.
Once compiled, ensure you are able to run it (you will need at least 4 MPI processes -- for instance with -N 2 --ntasks-per-node 2
):
$> cd ~/tutorials/HPL/src/hpl-2.3/bin/intel64
$> cat HPL.dat # Default (dummy) HPL.dat input file
# On Slurm cluster, store the output logs into a text file -- see tee
$> srun -n $SLURM_NTASKS ./xhpl | tee test_run.logs
Check the output results with less test_run.logs
. You can also quickly see the 10 best results obtained by using:
# ================================================================================
# T/V N NB P Q Time Gflops
# --------------------------------------------------------------------------------
$> grep WR test_run.logs | sort -k 7 -n -r | head -n 10
WR00L2L2 29 3 4 1 0.00 9.9834e-03
WR00L2R2 35 2 4 1 0.00 9.9808e-03
WR00R2R4 35 2 4 1 0.00 9.9512e-03
WR00L2C2 30 2 1 4 0.00 9.9436e-03
WR00R2C2 35 2 4 1 0.00 9.9411e-03
WR00R2R2 35 2 4 1 0.00 9.9349e-03
WR00R2R2 30 2 1 4 0.00 9.8879e-03
WR00R2C4 30 2 1 4 0.00 9.8771e-03
WR00C2R2 35 2 4 1 0.00 9.8323e-03
WR00L2C4 29 3 4 1 0.00 9.8049e-03
Alternatively, you can use the building script scripts/build.HPL
to build the HPL sources on both broadwell and skylake nodes (with the corresponding architectures):
# (eventually) release you past interactive job to return on access
$> exit
$> cd ~/tutorials/HPL
# Create symlink to the scripts directory
$> ln -s ref.d/scripts .
# Create a logs/ directory to store the Slurm logs
$> mkdir logs
# Now submit two building jobs targeting both CPU architecture
$> sbatch -C broadwell ./scripts/build.HPL -n broadwell # Will produce bin/xhpl_broadwell
$> sbatch -C skylake ./scripts/build.HPL -n skylake # Will produce bin/xhpl_skylake
Preparing batch runs
We are now going to prepare launcher scripts to permit passive runs (typically in the {default | batch}
queue).
We will place them in a separate directory (runs/
) as it will host the outcomes of the executions on the UL HPC platform .
$> cd ~/tutorials/HPL
$> mkdir -p runs/{broadwell,skylake}/{1N,2N}/{MPI,Hybrid}/ # Prepare the specific run directory
$> cp ref.d/
We are indeed going to run HPL in two different contexts:
- Full MPI, with 1 MPI process per (physical) core reserved.
- As mentioned in the basics Parallel computations with OpenMP/MPI tutorial, it means that you'll typically reserve the nodes using the
-N <#nodes> --ntasks-per-node 28
options for Slurm as there are in general 28 cores per nodes oniris
.
- As mentioned in the basics Parallel computations with OpenMP/MPI tutorial, it means that you'll typically reserve the nodes using the
- Hybrid OpenMP+MPI, with 1 MPI process per CPU socket, and as many OpenMP threads as per (physical) core reserved.
- As mentioned in the basics Parallel computations with OpenMP/MPI tutorial, it means that you'll typically reserve the nodes using the
-N <#nodes> --ntasks-per-node 2 --ntasks-per-socket 1 -c 14
options for Slurm there are in general 2 processors (each with 14 cores) per nodes oniris
- As mentioned in the basics Parallel computations with OpenMP/MPI tutorial, it means that you'll typically reserve the nodes using the
These two contexts will directly affect the values for the HPL parameters P
and Q
since their product should match the total number of MPI processes.
HPL main parameters
Running HPL depends on a configuration file HPL.dat
-- an example is provided in the building directory i.e. src/hpl-2.3/bin/intel64/HPL.dat
.
$> cat src/hpl-2.3/bin/intel64/HPL.dat
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any)
6 device out (6=stdout,7=stderr,file)
4 # of problems sizes (N)
29 30 34 35 Ns
4 # of NBs
1 2 3 4 NBs
0 PMAP process mapping (0=Row-,1=Column-major)
3 # of process grids (P x Q)
2 1 4 Ps
2 4 1 Qs
16.0 threshold
3 # of panel fact
0 1 2 PFACTs (0=left, 1=Crout, 2=Right)
2 # of recursive stopping criterium
2 4 NBMINs (>= 1)
1 # of panels in recursion
2 NDIVs
3 # of recursive panel fact.
0 1 2 RFACTs (0=left, 1=Crout, 2=Right)
1 # of broadcast
0 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
0 DEPTHs (>=0)
2 SWAP (0=bin-exch,1=long,2=mix)
64 swapping threshold
0 L1 in (0=transposed,1=no-transposed) form
0 U in (0=transposed,1=no-transposed) form
1 Equilibration (0=no,1=yes)
8 memory alignment in double (> 0)
See http://www.netlib.org/benchmark/hpl/tuning.html for a description of this file and its parameters (see also the authors tips).
You can use the following sites for finding the appropriate values:
- Tweak HPL parameters
- HPL Calculator to find good parameters and expected performances
The main parameters to play with for optimizing the HPL runs are:
NB
: depends on the CPU architecture, use the recommended blocking sizes (NB
in HPL.dat) listed after loading thetoolchain/intel
module under$EBROOTIMKL/compilers_and_libraries/linux/mkl/benchmarks/mp_linpack/readme.txt
, i.eNB=192
for the broadwell processors available oniris
NB=384
on the skylake processors available oniris
P
andQ
, knowing that the productP x Q
SHOULD typically be equal to the number of MPI processes.- Of course
N
the problem size.
An example of P by Q partitioning of a HPL matrix in 6 processes (2x3 decomposition) (Source )
In order to find out the best performance of your system, the largest problem size fitting in memory is what you should aim for.
Since HPL performs computation on an N x N array of Double Precision (DP) elements, and that each double precision element requires sizeof(double)
= 8 bytes, the memory consumed for a problem size of N is 8N^2.
It follows that N
can be derived from a simple dimensional analysis based on the involved volatile memory to compute the number of Double Precision :
where \alpha is a global ratio normally set to (at least) 80% (best results are typically obtained with \alpha > 92%).
Alternatively, one can target a ratio \beta of the total memory used (for instance 85%), i.e.
Note that the two ratios you might consider are of course linked, i.e. \beta = \alpha^2
Finally, the problem size should be ideally set to a multiple of the block size NB
.
Example of HPL parameters we are going to try (when using regular nodes on the batch
partition) are proposed on the below table. Note that we will use on purpose a relatively low value for the ratio \alpha (or \beta), and thus N, to ensure relative fast runs within the time of this tutorial.
Architecture | #Node | Mode | MPI proc. | NB | PxQ | \alpha | N |
---|---|---|---|---|---|---|---|
broadwell | 1 | MPI | 28 | 192 | 1x28, 2x14, 4x7 | 0.3 | 39360 |
broadwell | 2 | MPI | 56 | 192 | 1x56, 2x28, 4x14, 7x8 | 0.3 | 55680 |
broadwell | 1 | Hybrid | 2 | 192 | 1x2 | 0.3 | 39360 |
broadwell | 2 | Hybrid | 4 | 192 | 1x2 | 0.3 | 55680 |
skylake | 1 | MPI | 28 | 384 | 1x28, 2x14, 4x7 | 0.3 | 39168 |
skylake | 2 | MPI | 56 | 384 | 1x56, 2x28, 4x14, 7x8 | 0.3 | 55680 |
skylake | 1 | Hybrid | 2 | 384 | 1x2 | 0.3 | 39168 |
skylake | 2 | Hybrid | 4 | 384 | 1x2 | 0.3 | 55680 |
You can use the script scripts/compute_N
to compute the value of N depending on the global ratio \alpha (using -r <alpha>
) or \beta (using -p <beta*100>
).
./scripts/compute_N -h
# 1 Broadwell node, alpha = 0.3
./scripts/compute_N -m 128 -NB 192 -r 0.3 -N 1
# 2 Skylake (regular) nodes, alpha = 0.3
./scripts/compute_N -m 128 -NB 384 -r 0.3 -N 2
# 4 bigmem (skylake) nodes, beta = 0.85
./scripts/compute_N -m 3072 -NB 384 -p 85 -N 4
Using the above values, create the appropriate HPL.dat
files for each case, under the appropriate directory, i.e. runs/<arch>/<N>N/
Slurm launcher (Intel MPI)
Copy and adapt the default MPI SLURM launcher you should have a copy in ~/git/ULHPC/launcher-scripts/slurm/launcher.default.sh
Copy and adapt the default SLURM launcher you should have a copy in ~/git/ULHPC/launcher-scripts/slurm/launcher.default.sh
$> cd ~/tutorials/HPL/runs
# Prepare a laucnher for intel suit
$> cp ~/git/github.com/ULHPC/launcher-scripts/slurm/launcher.default.sh launcher-HPL.intel.sh
Take your favorite editor (vim
, nano
, etc.) to modify it according to your needs.
Here is for instance a suggested difference for intel MPI (adapt accordingly):
--- ~/git/ULHPC/launcher-scripts/slurm/launcher.default.sh 2017-06-11 23:40:34.007152000 +0200
+++ launcher-HPL.intel.sh 2017-06-11 23:41:57.597055000 +0200
@@ -10,8 +10,8 @@
#
# Set number of resources
#
-#SBATCH -N 1
+#SBATCH -N 2
#SBATCH --ntasks-per-node=28
### -c, --cpus-per-task=<ncpus>
### (multithreading) Request that ncpus be allocated per process
#SBATCH -c 1
@@ -64,15 +64,15 @@
module load toolchain/intel
# Directory holding your built applications
-APPDIR="$HOME"
+APPDIR="$HOME/tutorials/HPL/src/hpl-2.3/bin/intel64"
# The task to be executed i.E. your favorite Java/C/C++/Ruby/Perl/Python/R/whatever program
# to be invoked in parallel
-TASK="${APPDIR}/app.exe"
+TASK="${APPDIR}/xhpl"
# The command to run
-CMD="${TASK}"
+# CMD="${TASK}"
### General MPI Case:
-# CMD="srun -n $SLURM_NTASKS ${TASK}"
+CMD="srun -n $SLURM_NTASKS ${TASK}"
### OpenMPI case if you wish to specialize the MCA parameters
#CMD="mpirun -np $SLURM_NTASKS --mca btl openib,self,sm ${TASK}"
Now you should create an input HPL.dat
file within the runs/<arch>/<N>N/<mode>
.
$> cd ~/tutorials/HPL/runs
$> cp ../ref.d/HPL.dat .
$> ll
total 0
-rw-r--r--. 1 svarrette clusterusers 1.5K Jun 12 15:38 HPL.dat
-rwxr-xr-x. 1 svarrette clusterusers 2.7K Jun 12 15:25 launcher-HPL.intel.sh
You are ready for testing a batch job:
$> cd ~/tutorials/HPL/runs
$> sbatch ./launcher-HPL.intel.sh
$> sq # OR (long version) squeue -u $USER
(bonus) Connect to one of the allocated nodes and run htop
(followed by u
to select process run under your username, and F5
to enable the tree-view.
Now you can check the output of the HPL runs:
$> grep WR slurm-<jobid>.out # /!\ ADAPT <jobid> appropriately.
Of course, we made here a small test and optimizing the HPL parameters to get the best performances and efficiency out of a given HPC platform is not easy.
Below are some plots obtained when benchmarking the iris
cluster and seeking the best set of parameters across increasing number of nodes (see this blog post)