Scalable Science and Parallel computations with OpenMP/MPI
Copyright (c) 2013-2021 UL HPC Team <hpc-sysadmins@uni.lu>
When granted access to the UL HPC platform you will have at your disposal parallel computing resources.
Thus you will be able to run:
- ideally parallel (OpenMP, MPI, CUDA, OpenCL...) jobs
- however if your workflow involves serial tasks/jobs, you must run them efficiently
The objective of this tutorial is to show you how to run your OpenMP and/or [hybrid] MPI applications on top of the UL HPC platform.
For all the executions we are going to perform in this tutorial, you probably want to monitor the parallel execution on one of the allocated nodes. To do that, and assuming you have reserved computing resources or have a passive job running (see below):
- open another terminal (or another
tmux/screen
window -- see ) as you'll want to monitor the execution. - Connect to the allocated node with
sjoin <jobid> <nodename)
- For this new terminal/window
- run
htop
- press 'u' to filter by process owner, select your login
- press 'F5' to enable the tree view
- run
Note that there are other more advanced ways and tools to monitor parallel executions over OpenMP/MPI that are covered in another tutorial.
Finally, and this is especially true for hydrid OpenMP/MPI code, remember that you should always align resource specs with physical NUMA characteristics
- Ex (AION): 16 cores per socket, 8 sockets ("physical" CPUs) per node (128c/node)
[-N <N>] --ntasks-per-node <8n> --ntasks-per-socket <n> -c <thread>
- Total:
<N>
\times 8\times<n>
tasks, each on<thread>
threads, Ensure<n>
\times<thread>
= 16 on aion if you target a full node utilisation - Ex:
-N 2 --ntasks-per-node 32 --ntasks-per-socket 4 -c 4
(Total: 64 tasks)
- Total:
- Ex (IRIS): 14 cores per socket, 2 sockets ("physical" CPUs) per node (28c/node)
[-N <N>] --ntasks-per-node <2n> --ntasks-per-socket <n> -c <thread>
- Total:
<N>
\times 2\times<n>
tasks, each on<thread>
threads, Ensure<n>
\times<thread>
= 14 on iris if you target a full node utilisation - Ex:
-N 2 --ntasks-per-node 4 --ntasks-per-socket 2 -c 7
(Total: 8 tasks)
- Total:
Pre-requisites
Ensure you are able to connect to the UL HPC clusters.
In particular, recall that the module
command is not available on the access frontends. For all tests and compilation, you MUST work on a computing node
Now you'll need to pull the latest changes in your working copy of the ULHPC/tutorials you should have cloned in ~/git/github.com/ULHPC/tutorials
(see "preliminaries" tutorial)
(access)$> cd ~/git/github.com/ULHPC/tutorials
(access)$> git pull
Now configure a dedicated directory ~/tutorials/OpenMP-MPI
for this session
# return to your home
(access)$> mkdir -p ~/tutorials/OpenMP-MPI/bin
(access)$> cd ~/tutorials/OpenMP-MPI
# create a symbolic link to the top reference material
(access)$> ln -s ~/git/github.com/ULHPC/tutorials/parallel ref.d
$> ln -s ~/git/github.com/ULHPC/tutorials/parallel ref.d # Symlink to the **root** reference parallel tutorial
$> ln -s ref.d/basics . # Basics instructions
$> cd basics
Advanced users (eventually yet strongly recommended), create a Tmux session (see Tmux cheat sheet and tutorial) or GNU Screen session you can recover later. See also "Getting Started" tutorial .
# /!\ Advanced (but recommended) best-practice:
# Always work within a TMux or GNU Screen session named '<topic>' (Adapt accordingly)
(access-aion)$> tmux new -s HPC-school # Tmux
(access-iris)$> screen -S HPC-school # GNU Screen
# TMux | GNU Screen | Action
# ----------|------------|----------------------------------------------
# CTRL+b c | CTRL+a c | (create) creates a new Screen window. The default Screen number is zero.
# CTRL+b n | CTRL+a n | (next) switches to the next window.
# CTRL+b p | CTRL+a p | (prev) switches to the previous window.
# CTRL+b , | CTRL+a A | (title) rename the current window
# CTRL+b d | CTRL+a d | (detach) detaches from a Screen -
# Once detached:
# tmux ls | screen -ls : list available screen
# tmux att | screen -x : reattach to a past screen
Parallel OpenMP Jobs
OpenMP (Open Multi-Processing) is a popular parallel programming model for multi-threaded applications. More precisely, it is an Application Programming Interface (API) that supports multi-platform shared memory multiprocessing programming in C, C++, and Fortran on most platforms, instruction set architectures and operating systems.
- Reference website: https://www.openmp.org/
- Latest version: 5.2 (Nov 2021) -- specifications
- Below notes are adapted from LLNL OpenMP tutorial
OpenMP is designed for multi-processor/core, shared memory machine (nowadays NUMA). OpenMP programs accomplish parallelism exclusively through the use of threads.
- A thread of execution is the smallest unit of processing that can be scheduled by an operating system.
- Threads exist within the resources of a single process. Without the process, they cease to exist.
- Typically, the number of threads match the number of machine processors/cores -- see Resource Allocation
- Reminder: aion compute nodes MUST be seen as 8 (virtual) processors of 16 cores each, even if physically the nodes are hosting 2 physical sockets of AMD Epyc ROME 7H12 processors having 64 cores each (total: 128 cores per node). iris: compute nodes typically hosts 2 physical processors of 14 cores each (total: 28 cores per nodes). The exception are the bigmem nodes (4 physical processors of 28 cores each, total 112 cores per nodes)
- However, the actual use of threads is up to the application.
OMP_NUM_THREADS
(if present) specifies initially the max number of threads;- you can use
omp_set_num_threads()
to override the value ofOMP_NUM_THREADS
; - the presence of the
num_threads
clause overrides both other values.
- you can use
-
OpenMP is an explicit (not automatic) programming model, offering the programmer full control over parallelization.
- parallelization can be as simple as taking a serial program and inserting compiler directives....
- in general, this is way more complex
-
OpenMP uses the fork-join model of parallel execution
- FORK: the master thread then creates a team of parallel threads.
- The statements in the program that are enclosed by the parallel region construct are then executed in parallel among the various team threads.
- JOIN: When the team threads complete the statements in the parallel region construct, they synchronize and terminate, leaving only the master thread.
- FORK: the master thread then creates a team of parallel threads.
Slurm reservations for OpenMP programs
- (eventually as this is the default but recommended to always specify to ensure you know what you're doing) set a single task per node with
--ntasks-per-node=1
- Use
-c <N>
(or--cpus-per-task <N>
) to set the number of OpenMP threads you wish to use. - (again) The number of threads should not exceed the number of cores on a compute node.
Thus a minimal Slurm launcher for OpenMP would typically look like that -- see also our default Slurm launchers.
#!/bin/bash -l
#SBATCH --ntasks-per-node=1 # Run a single task per node, more explicit than '-n 1'
#SBATCH -c 28 # on iris: number of CPU cores i.e. OpenMP threads per task
###SBATCH -c 128 # on aion (remove first '##' and top line)
#SBATCH --time=0-01:00:00
#SBATCH -p batch
print_error_and_exit() { echo "***ERROR*** $*"; exit 1; }
module purge || print_error_and_exit "No 'module' command"
module load toolchain/foss # or toolchain/intel
export SRUN_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK}
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
OPTS=$*
srun /path/to/your/threaded.app ${OPTS}
OpenMP Compilation
Toolchain | Compilation command (C) | Compilation command (C++) |
---|---|---|
toolchain/intel |
icc -qopenmp [...] |
icpc -qopenmp [...] |
toolchain/foss |
gcc -fopenmp [...] |
g++ -fopenmp [...] |
Hands-on: OpenMP Helloworld and matrix multiplication
You can find in src/hello_openmp.c
the traditional OpenMP "Helloworld" example.
- Reserve an interactive job to launch 4 OpenMP threads (for 30 minutes)
(access)$> si -c 4 -t 0:30:00
$> export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
-
Check the set variable
$OMP_NUM_THREADS
. Which value do you expect?$> echo $OMP_NUM_THREADS
-
Check and compile the source
src/hello_openmp.c
to generate:bin/${ULHPC_CLUSTER}_hello_openmp
(compiled over thefoss
toolchain)bin/${ULHPC_CLUSTER}_intel_hello_openmp
(compiled over theintel
toolchain)- indeed it's always a good practice to specify the supercomputer that was used to generate the binary. Use the global variable
${ULHPC_CLUSTER}
for that.
$> cat src/hello_openmp.c
######### foss toolchain
$> module purge # Safeguard
$> module load toolchain/foss
$> gcc -fopenmp -Wall -O2 src/hello_openmp.c -o bin/${ULHPC_CLUSTER}_hello_openmp
######### intel toolchain
$> module purge # Safeguard
$> module load toolchain/intel
$> icc -qopenmp -xhost -Wall -O2 src/hello_openmp.c -o bin/intel_${ULHPC_CLUSTER}_hello_openmp
-
(only if you have trouble to compile):
make omp
-
Execute the generated binaries multiple times. What do you notice?
- Exit your interactive session (
exit
orCTRL-D
) - Prepare a launcher script (use your favorite editor) to execute this application in batch mode -- adapt pthreads/OpenMP template Launcher
$> sbatch ./launcher.OpenMP.sh
Repeat the above procedure on a more serious computation: a naive matrix multiplication using OpenMP, those source code is located in src/matrix_mult_openmp.c
Adapt the launcher script to sustain both executions (OpenMP helloworld and matrix multiplication)
Note: if you are lazy (or late), you can use the provided launcher script scripts/launcher.OpenMP.sh
.
$> ./scripts/launcher.OpenMP.sh -h
NAME
launcher.OpenMP.sh: Generic OpenMP launcher
Default APPDIR: /home/users/svarrette/tutorials/OpenMP-MPI/basics/bin
Default APP: aion_hello_openmp
Take the good habit to prefix the intel binaries (as foss toolchain is assumed by default)
with 'intel_'
USAGE
[sbatch] ./scripts/launcher.OpenMP.sh [-n] {intel | foss } [app]
EXE=/path/to/multithreadedapp.exe [sbatch] ./scripts/launcher.OpenMP.sh [-n] {intel | foss }
OPTIONS:
-n --dry-run: Dry run mode
Example:
[sbatch] ./scripts/launcher.OpenMP.sh # run FOSS build <cluster>_hello_openmp
[sbatch] ./scripts/launcher.OpenMP.sh intel # run intel build intel_<cluster>_hello_openmp
[sbatch] ./scripts/launcher.OpenMP.sh foss matrix_mult_openmp # run FOSS build matrix_mult_openmp
EXE=/home/users/svarrette/bin/datarace [sbatch] ./scripts/launcher.OpenMP.sh intel # run intel build ~/bin/datarace
Now you can execute it:
$ ./scripts/launcher.OpenMP.sh
$ ./scripts/launcher.OpenMP.sh intel
$ ./scripts/launcher.OpenMP.sh foss ${ULHPC_CLUSTER}_matrix_mult_openmp
Passive jobs examples on aion:
$> sbatch -c 128 ./scripts/launcher.OpenMP.sh foss ${ULHPC_CLUSTER}_matrix_mult_openmp
$> sbatch -c 128 ./scripts/launcher.OpenMP.sh intel ${ULHPC_CLUSTER}_matrix_mult_openmp
Check the elapsed time: what do you notice ?
(optional) Hands-on: OpenMP data race benchmark suite
One way to test most of OpenMP feature is to evaluate its execution against a benchmark. For instance, we are going to test OpenMP installation against DataRaceBench, a benchmark suite designed to systematically and quantitatively evaluate the effectiveness of data race detection tools. It includes a set of microbenchmarks with and without data races. Parallelism is represented by OpenMP directives.
$> cd ~/git/github.com/ULHPC/tutorials/parallel/basics
$> make fetch # clone src/dataracebench
$> cd src/dataracebench
Now you can reserve the nodes and set OMP_NUM_THREADS
:
- Reserve an interactive job to launch a maximum of OpenMP threads on 1 node (for 1 hour)
# Example on Aion, 128 cores per node
(access)$> si --ntasks-per-node=1 -c 128 -t 1:00:00
$> export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
-
Open another terminal (or another
tmux/screen
window) to monitor the execution (see intructions on top). -
Execute the benchmark, for instance using the intel toolchain:
$> module load toolchain/intel
$> ./check-data-races.sh --help
Usage: ./check-data-races.sh [--run] [--help] language
--help : this option
--small : compile and test all benchmarks using small parameters with Helgrind, ThreadSanitizer, Archer, Intel inspector.
--run : compile and run all benchmarks with gcc (no evaluation)
--run-intel : compile and run all benchmarks with Intel compilers (no evaluation)
--helgrind : compile and test all benchmarks with Helgrind
--tsan-clang: compile and test all benchmarks with clang ThreadSanitizer
--tsan-gcc : compile and test all benchmarks with gcc ThreadSanitizer
--archer : compile and test all benchmarks with Archer
--coderrect : compile and test all benchmarks with Coderrect Scanner
--inspector : compile and test all benchmarks with Intel Inspector
--romp : compile and test all benchmarks with Romp
--llov : compile and test all benchmarks with LLVM OpenMP Verifier (LLOVE)
--customize : compile and test customized test list and tools
$> ./check-data-races.sh --run-intel C
Useful OpenMP links:
Parallel/Distributed MPI Jobs
The Message Passing Interface (MPI) Standard is a message passing library standard based on the consensus of the MPI Forum. The goal of the Message Passing Interface is to establish a portable, efficient, and flexible standard for message passing that will be widely used for writing message passing programs. MPI is not an IEEE or ISO standard, but has in fact, become the "industry standard" for writing message passing programs on HPC platforms.
- Reference website: https://www.mpi-forum.org/
- Latest version: 4.0 (June 2021) -- specifications
- Below notes are adapted from LLNL MPI tutorial
In the MPI programming model, a computation comprises one or more processes that communicate by calling library routines to send and receive messages to other processes. In most MPI implementations, a fixed set of processes is created at program initialization, and one process is created per processor.
MPI implementations
The UL HPC platform offers to you different MPI implementations:
MPI Suite | module load ... |
Compiler (C) | Compiler (C++) |
---|---|---|---|
Intel MPI | toolchain/intel |
mpiicc [...] |
mpiicpc [...] |
OpenMPI | mpi/OpenMPI |
mpicc [...] |
mpic++ [...] |
MPI compilation
MPI Suite | Example Compilation command |
---|---|
Intel MPI | {mpiicc/mpiicpc} -Wall [-qopenmp] [-xhost] -O2 [...] |
OpenMPI | {mpicc/mpic++} -Wall [-fopenmp] -O2 [...] |
Of course, it is possible to have hybrid code, mixing MPI and OpenMP primitives.
Slurm reservations and usage for MPI programs
- set the number of distributed nodes you want to reserver with
-N <N>
- set the number of MPI processes per node (that's more explicit) with
--ntasks-per-node=<N>
- you can also use
-n <N>
to specify the total number of MPI processes you want, but the above approach is advised.
- you can also use
- (eventually as this is the default) set a single thread per MPI process with
-c 1
- except when running an hybrid code...
Important:
- To run your MPI program, be aware that Slurm is able to directly launch MPI tasks and initialize of MPI communications via Process Management Interface (PMI) and PMIx
- permits to resolve the task affinity by the scheduler (avoiding to use
mpirun --map-by [...]
)
- permits to resolve the task affinity by the scheduler (avoiding to use
- Simply use (whatever MPI flavor you use):
srun -n $SLURM_NTASKS /path/to/mpiprog [...]
Thus a minimal launcher for OpenMPI would typically look like that -- see MPI template Launcher
#!/bin/bash -l
#SBATCH -N 2
#SBATCH --ntasks-per-node 128 # MPI processes per node - use 28 on iris
#SBATCH -c 1
#SBATCH --time=0-01:00:00
#SBATCH -p batch
print_error_and_exit() { echo "***ERROR*** $*"; exit 1; }
module purge || print_error_and_exit "No 'module' command"
module load toolchain/foss
module load mpi/OpenMPI
OPTS=$*
srun -n $SLURM_NTASKS /path/to/your/openmpi.app ${OPTS}
In the above example, 2x128 = 256 MPI processes will be launched (matches Aion configuration). You will have to adapt it for running on Iris.
Hands-on: MPI Helloworld and matrix multiplication
You can find in src/hello_mpi.c
the traditional MPI "Helloworld" example.
- Reserve an interactive job to launch 6 MPI processes across two nodes 2x3 (for 30 minutes)
(access)$> si -N 2 --ntasks-per-node=3 -t 0:30:00
- Check and compile the source
src/hello_mpi.c
to generate:bin/openmpi_hello_mpi
(compiled with thempi/OpenMPI
module)bin/intel_hello_mpi
(compiled over theintel
toolchain and Intel MPI)
$> cat src/hello_mpi.c
######### OpenMPI
$> module purge # Safeguard
$> module load mpi/OpenMPI
$> mpicc -Wall -O2 src/hello_mpi.c -o bin/openmpi_${ULHPC_CLUSTER}_hello_mpi
######### Intel MPI
$> module purge # Safeguard
$> module load toolchain/intel
$> mpiicc -Wall -xhost -O2 src/hello_mpi.c -o bin/intel_${ULHPC_CLUSTER}_hello_mpi
- (only if you have trouble to compile):
make mpi
- Execute the generated binaries multiple times. What do you notice?
- Exit your interactive session (
exit
orCTRL-D
) - Prepare a launcher script (use your favorite editor) to execute this application in batch mode -- -- adapt MPI template Launcher
$> sbatch ./launcher.MPI.sh
Repeat the above procedure on a more serious computation: a naive matrix multiplication using MPI, those source code is located in src/matrix_mult_mpi.c
Adapt the launcher script to sustain both executions (MPI helloworld and matrix multiplication)
Note: if you are lazy (or late), you can use the provided launcher script scripts/launcher.MPI.sh
.
$ ./scripts/launcher.MPI.sh -h
NAME
launcher.MPI.sh: Generic MPI launcher
Default APPDIR: /Users/svarrette/tutorials/OpenMP-MPI/basics/bin
Default APP: _hello_mpi
Take the good habit to prefix the binary to execute with MPI suit used for
the build. Here the default MPI application run would be
EXE=/Users/svarrette/tutorials/OpenMP-MPI/basics/bin/openmpi__hello_mpi
which will be run as srun -n $SLURM_NTASKS [...]
USAGE
[sbatch] ./scripts/launcher.MPI.sh [-n] {intel | openmpi | mvapich2} [app]
EXE=/path/to/mpiapp.exe [sbatch] ./scripts/launcher.MPI.sh [-n] {intel | openmpi | mvapich2}
OPTIONS:
-n --dry-run: Dry run mode
Example:
[sbatch] ./scripts/launcher.MPI.sh # run OpenMPI build openmpi_<cluster>_hello_mpi
[sbatch] ./scripts/launcher.MPI.sh intel # run Intel MPI build intel_<cluster>_hello_mpi
[sbatch] ./scripts/launcher.MPI.sh openmpi matrix_mult_mpi # run OpenMPI build openmpi_matrix_mult_mpi
EXE=/Users/svarrette/bin/xhpl [sbatch] ./scripts/launcher.MPI.sh intel # run intel build ~/bin/xhpl
Now you can execute it:
$ ./scripts/launcher.MPI.sh
$ ./scripts/launcher.MPI.sh intel
$ ./scripts/launcher.MPI.sh ${ULHPC_CLUSTER}_matrix_mult_mpi
$ ./scripts/launcher.MPI.sh intel ${ULHPC_CLUSTER}_matrix_mult_mpi
Passive jobs examples:
$ sbatch --ntasks-per-node 128 ./scripts/launcher.MPI.sh openmpi
$ sbatch --ntasks-per-node 128 ./scripts/launcher.MPI.sh intel ${ULHPC_CLUSTER}_matrix_mult_mpi
Check the elapsed time: what do you notice ?
Useful MPI links:
- http://www.mpi-forum.org/docs/
- MPI Tutorial LLNL
- Intel MPI:
- Step by Step Performance Optimization with IntelĀ® C++ Compiler
- Intel(c) C++ Compiler Developer Guide and Reference
Hybrid OpenMP+MPI Programs
Of course, you can have hybrid code mixing MPI and OpenMP primitives.
- You need to compile the code with the
-qopenmp
(with Intel MPI) or-fopenmp
(for the other MPI suits) flags - You need to adapt the
OMP_NUM_THREADS
environment variable accordingly- you need to adapt the value
-c <N>
(or--cpus-per-task <N>
) to set the number of OpenMP threads you wish to use per MPI process - try to inherit from the Slurm allocation (and provide a meaningfull default value):
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
- you need to adapt the value
- For best performances, you MUST align resource specs with physical NUMA characteristics -- see ULHPC Technical documentation on Slurm Resource Allocation
- aion compute nodes MUST be seen as 8 (virtual) processors of 16 cores each, even if physically the nodes are hosting 2 physical sockets of AMD Epyc ROME 7H12 processors having 64 cores each (total: 128 cores per node).
- iris compute nodes typically hosts 2 physical processors of 14 cores each (total: 28 cores per nodes). The exception are the bigmem nodes (4 physical processors of 28 cores each, total 112 cores per nodes).
Other misc considerations:
- You need to ensure the environment variable
OMP_NUM_THREADS
is shared across the nodes - (Intel MPI only) you probably want to set
I_MPI_PIN_DOMAIN=omp
- Like any MPI execution, simply use for whatever MPI flavor you use:
srun -n $SLURM_NTASKS /path/to/hybrid [...]
Thus a minimal launcher for hybrid OpenMP/MPI would typically look like that -- see Hybrid OpenMP+MPI template Launcher
#!/bin/bash -l # Multi-node hybrid application IntelMPI+OpenMP launcher
#SBATCH -N 2
#SBATCH --ntasks-per-node 8 # MPI processes per node - use 2 on iris
#SBATCH --ntasks-per-socket 1 # MPI processes per [virtual] processor
#SBATCH -c 16 # OpenMP threads per MPI process - use 14 on iris
#SBATCH --time=0-01:00:00
#SBATCH -p batch
print_error_and_exit() { echo "***ERROR*** $*"; exit 1; }
module purge || print_error_and_exit "No 'module' command"
module load mpi/OpenMPI # or toolchain/intel
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
OPTS=$*
srun -n $SLURM_NTASKS /path/to/your/parallel-hybrid-app ${OPTS}
In the above example, 2x8 = 16 MPI processes will be launched, each with 16 OpenMP threads (to match Aion configuration). You will have to adapt it for running on Iris.
Hands-on: Hybrid OpenMP+MPI Helloworld
You can find in src/hello_hybrid.c
the traditional OpenMP+MPI "Helloworld" example.
- Reserve an interactive job to launch 2 MPI processes (1 per node), each composed of 4 OpenMP threads (for 30 minutes)
(access)$ si -N 2 --ntasks-per-node=1 -c 4 -t 0:30:00
$ export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
-
Check the set variable
$OMP_NUM_THREADS
. Which value do you expect?$> echo $OMP_NUM_THREADS
-
Check and compile the source
src/hello_hybrid.c
to generate:bin/openmpi_hello_hybrid
(compiled with thempi/OpenMPI
module)bin/intel_hello_hybrid
(compiled over theintel
toolchain and Intel MPI)
$> cat src/hello_hybrid.c
######### OpenMPI
$> module purge # Safeguard
$> module load mpi/OpenMPI
$> mpicc -fopenmp -Wall -O2 src/hello_hybrid.c -o bin/openmpi_${ULHPC_CLUSTER}_hello_hybrid
######### Intel MPI
$> module purge # Safeguard
$> module load toolchain/intel
$> mpiicc -qopenmp -Wall -xhost -O2 src/hello_hybrid.c -o bin/intel_${ULHPC_CLUSTER}_hello_hybrid
- (only if you have trouble to compile):
make hybrid
- Execute the generated binaries (see above tips)
- Exit your interactive session (
exit
orCTRL-D
) - Adapt the MPI launcher to allow for batch jobs submissions over hybrid programs
$> sbatch ./launcher.hybrid.sh
Note: if you are lazy (or late), you can use the provided launcher script runs/launcher.hybrid.sh
.
$ ./scripts/launcher.hybrid.sh -h
NAME
launcher.hybrid.sh: Generic Hybrid OpenMP+MPI launcher
Default APPDIR: /Users/svarrette/tutorials/OpenMP-MPI/basics/bin
Default APP: _hello_hybrid
Take the good habit to prefix the binary to execute with MPI suit used for
the build. Here the default Hybrid OpenMP+MPI application run would be
EXE=/Users/svarrette/tutorials/OpenMP-MPI/basics/bin/openmpi__hello_hybrid
which will be run as srun -n $SLURM_NTASKS [...]
USAGE
[sbatch] ./scripts/launcher.hybrid.sh [-n] {intel | openmpi | mvapich2} [app]
EXE=/path/to/hydridapp.exe [sbatch] ./scripts/launcher.hybrid.sh [-n] {intel | openmpi | mvapich2}
OPTIONS:
-n --dry-run: Dry run mode
Example:
[sbatch] ./scripts/launcher.hybrid.sh # run hybrid OpenMPI build openmpi_<cluster>_hello_hybrid
[sbatch] ./scripts/launcher.hybrid.sh intel # run hybrid Intel MPI build intel_<cluster>_hello_hybrid
[sbatch] ./scripts/launcher.hybrid.sh openmpi matrix_mult # run hybrid OpenMPI build openmpi_matrix_mult
EXE=/Users/svarrette/bin/hpcg [sbatch] ./scripts/launcher.hybrid.sh intel # run hybrid intel build ~/bin/hpcg
Now you can execute it:
$ ./scripts/launcher.hybrid.sh
$ ./scripts/launcher.hybrid.sh intel
Passive jobs examples:
# On Aion, you need to adapt the default settings
$> sbatch --ntasks-per-node 8 -c 16 ./scripts/launcher.hybrid.sh
$> sbatch --ntasks-per-node 8 -c 16 ./scripts/launcher.hybrid.sh intel
Code optimization tips for your OpenMP and/or MPI programs
-
Consider changing your memory allocation functions to avoid fragmentation and enable scalable concurrency support (this applies for OpenMP and/or MPI programs)
-
When using the
intel
toolchain:- see the Step by Step Performance Optimization with Intel(c) C++ Compiler
- the
-xhost
option permits to enable processor-specific optimization. - you might wish to consider Interprocedural Optimization (IPO) approach, an automatic, multi-step process that allows the compiler to analyze your code to determine where you can benefit from specific optimizations.
- the
- see the Step by Step Performance Optimization with Intel(c) C++ Compiler
Troubleshooting
srun: error: PMK_KVS_Barrier duplicate request from task ...
- you are trying to use
mpirun
(instead ofsrun
) from Intel MPI within a SLURM session and receive such error onmpirun
: make sure$I_MPI_PMI_LIBRARY
is not set (`unset I_MPI_PMI_LIBRARY``).