Scalable Science and Parallel computations with OpenMP/MPI

 Copyright (c) 2013-2021 UL HPC Team  <hpc-sysadmins@uni.lu>

When granted access to the UL HPC platform you will have at your disposal parallel computing resources.

Thus you will be able to run:

ideally parallel (OpenMP, MPI, CUDA, OpenCL...) jobs
however if your workflow involves serial tasks/jobs, you must run them efficiently

The objective of this tutorial is to show you how to run your OpenMP and/or [hybrid] MPI applications on top of the UL HPC platform.

For all the executions we are going to perform in this tutorial, you probably want to monitor the parallel execution on one of the allocated nodes. To do that, and assuming you have reserved computing resources or have a passive job running (see below):

open another terminal (or another tmux/screen window -- see ) as you'll want to monitor the execution.
Connect to the allocated node with sjoin <jobid> <nodename)
For this new terminal/window
- run htop
  - press 'u' to filter by process owner, select your login
  - press 'F5' to enable the tree view

Note that there are other more advanced ways and tools to monitor parallel executions over OpenMP/MPI that are covered in another tutorial.

Finally, and this is especially true for hydrid OpenMP/MPI code, remember that you should always align resource specs with physical NUMA characteristics

Ex (AION): 16 cores per socket, 8 sockets ("physical" CPUs) per node (128c/node)
- [-N <N>] --ntasks-per-node <8n> --ntasks-per-socket <n> -c <thread>
  - Total: <N> $\times 8\times$ <n> tasks, each on <thread> threads, Ensure <n> $\times$ <thread>= 16 on aion if you target a full node utilisation
  - Ex: -N 2 --ntasks-per-node 32 --ntasks-per-socket 4 -c 4 (Total: 64 tasks)
Ex (IRIS): 14 cores per socket, 2 sockets ("physical" CPUs) per node (28c/node)
- [-N <N>] --ntasks-per-node <2n> --ntasks-per-socket <n> -c <thread>
  - Total: <N> $\times 2\times$ <n> tasks, each on <thread> threads, Ensure <n> $\times$ <thread>= 14 on iris if you target a full node utilisation
  - Ex: -N 2 --ntasks-per-node 4 --ntasks-per-socket 2 -c 7 (Total: 8 tasks)

Pre-requisites

Ensure you are able to connect to the UL HPC clusters. In particular, recall that the module command is not available on the access frontends. For all tests and compilation, you MUST work on a computing node

Now you'll need to pull the latest changes in your working copy of the ULHPC/tutorials you should have cloned in ~/git/github.com/ULHPC/tutorials (see "preliminaries" tutorial)

(access)$> cd ~/git/github.com/ULHPC/tutorials
(access)$> git pull

Now configure a dedicated directory ~/tutorials/OpenMP-MPI for this session

# return to your home
(access)$> mkdir -p ~/tutorials/OpenMP-MPI/bin
(access)$> cd ~/tutorials/OpenMP-MPI
# create a symbolic link to the top reference material
(access)$> ln -s ~/git/github.com/ULHPC/tutorials/parallel ref.d
$> ln -s ~/git/github.com/ULHPC/tutorials/parallel ref.d  # Symlink to the **root** reference parallel tutorial
$> ln -s ref.d/basics .   # Basics instructions
$> cd basics

Advanced users (eventually yet strongly recommended), create a Tmux session (see Tmux cheat sheet and tutorial) or GNU Screen session you can recover later. See also "Getting Started" tutorial .

# /!\ Advanced (but recommended) best-practice:
#     Always work within a TMux or GNU Screen session named '<topic>' (Adapt accordingly)
(access-aion)$> tmux new -s HPC-school   # Tmux
(access-iris)$> screen -S HPC-school     # GNU Screen
#  TMux     | GNU Screen | Action
# ----------|------------|----------------------------------------------
#  CTRL+b c | CTRL+a c   | (create) creates a new Screen window. The default Screen number is zero.
#  CTRL+b n | CTRL+a n   | (next) switches to the next window.
#  CTRL+b p | CTRL+a p   | (prev) switches to the previous window.
#  CTRL+b , | CTRL+a A   | (title) rename the current window
#  CTRL+b d | CTRL+a d   | (detach) detaches from a Screen -
# Once detached:
#   tmux ls  | screen -ls : list available screen
#   tmux att | screen -x  : reattach to a past screen

Parallel OpenMP Jobs

OpenMP (Open Multi-Processing) is a popular parallel programming model for multi-threaded applications. More precisely, it is an Application Programming Interface (API) that supports multi-platform shared memory multiprocessing programming in C, C++, and Fortran on most platforms, instruction set architectures and operating systems.

Reference website: https://www.openmp.org/
Latest version: 5.2 (Nov 2021) -- specifications
Below notes are adapted from LLNL OpenMP tutorial

OpenMP is designed for multi-processor/core, shared memory machine (nowadays NUMA). OpenMP programs accomplish parallelism exclusively through the use of threads.

A thread of execution is the smallest unit of processing that can be scheduled by an operating system.
- Threads exist within the resources of a single process. Without the process, they cease to exist.
Typically, the number of threads match the number of machine processors/cores -- see Resource Allocation
- Reminder: aion compute nodes MUST be seen as 8 (virtual) processors of 16 cores each, even if physically the nodes are hosting 2 physical sockets of AMD Epyc ROME 7H12 processors having 64 cores each (total: 128 cores per node). iris: compute nodes typically hosts 2 physical processors of 14 cores each (total: 28 cores per nodes). The exception are the bigmem nodes (4 physical processors of 28 cores each, total 112 cores per nodes)
- However, the actual use of threads is up to the application.
- OMP_NUM_THREADS (if present) specifies initially the max number of threads;
  - you can use omp_set_num_threads() to override the value of OMP_NUM_THREADS;
  - the presence of the num_threads clause overrides both other values.
OpenMP is an explicit (not automatic) programming model, offering the programmer full control over parallelization.
- parallelization can be as simple as taking a serial program and inserting compiler directives....
- in general, this is way more complex
OpenMP uses the fork-join model of parallel execution
- FORK: the master thread then creates a team of parallel threads.
  - The statements in the program that are enclosed by the parallel region construct are then executed in parallel among the various team threads.
- JOIN: When the team threads complete the statements in the parallel region construct, they synchronize and terminate, leaving only the master thread.

Slurm reservations for OpenMP programs

(eventually as this is the default but recommended to always specify to ensure you know what you're doing) set a single task per node with --ntasks-per-node=1
Use -c <N> (or --cpus-per-task <N>) to set the number of OpenMP threads you wish to use.
(again) The number of threads should not exceed the number of cores on a compute node.

Thus a minimal Slurm launcher for OpenMP would typically look like that -- see also our default Slurm launchers.

#!/bin/bash -l
#SBATCH --ntasks-per-node=1 # Run a single task per node, more explicit than '-n 1'
#SBATCH -c 28               # on iris: number of CPU cores i.e. OpenMP threads per task
###SBATCH -c 128            # on aion (remove first '##' and top line)
#SBATCH --time=0-01:00:00
#SBATCH -p batch

print_error_and_exit() { echo "***ERROR*** $*"; exit 1; }
module purge || print_error_and_exit "No 'module' command"
module load toolchain/foss    # or toolchain/intel

export SRUN_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK}
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
OPTS=$*

srun /path/to/your/threaded.app ${OPTS}

OpenMP Compilation

Toolchain	Compilation command (C)	Compilation command (C++)
`toolchain/intel`	`icc -qopenmp [...]`	`icpc -qopenmp [...]`
`toolchain/foss`	`gcc -fopenmp [...]`	`g++ -fopenmp [...]`

Hands-on: OpenMP Helloworld and matrix multiplication

You can find in src/hello_openmp.c the traditional OpenMP "Helloworld" example.

Reserve an interactive job to launch 4 OpenMP threads (for 30 minutes)

(access)$> si -c 4 -t 0:30:00
$> export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}

Check the set variable $OMP_NUM_THREADS. Which value do you expect?
```
$> echo $OMP_NUM_THREADS
```
Check and compile the source src/hello_openmp.c to generate:
- bin/${ULHPC_CLUSTER}_hello_openmp (compiled over the foss toolchain)
- bin/${ULHPC_CLUSTER}_intel_hello_openmp (compiled over the intel toolchain)
- indeed it's always a good practice to specify the supercomputer that was used to generate the binary. Use the global variable ${ULHPC_CLUSTER} for that.

$> cat src/hello_openmp.c
######### foss toolchain
$> module purge                # Safeguard
$> module load toolchain/foss
$> gcc -fopenmp -Wall -O2 src/hello_openmp.c -o bin/${ULHPC_CLUSTER}_hello_openmp

######### intel toolchain
$> module purge                # Safeguard
$> module load toolchain/intel
$> icc -qopenmp -xhost -Wall -O2 src/hello_openmp.c -o bin/intel_${ULHPC_CLUSTER}_hello_openmp

(only if you have trouble to compile): make omp
Execute the generated binaries multiple times. What do you notice?
Exit your interactive session (exit or CTRL-D)
Prepare a launcher script (use your favorite editor) to execute this application in batch mode -- adapt pthreads/OpenMP template Launcher

$> sbatch ./launcher.OpenMP.sh

Repeat the above procedure on a more serious computation: a naive matrix multiplication using OpenMP, those source code is located in src/matrix_mult_openmp.c

Adapt the launcher script to sustain both executions (OpenMP helloworld and matrix multiplication)

Note: if you are lazy (or late), you can use the provided launcher script scripts/launcher.OpenMP.sh.

$> ./scripts/launcher.OpenMP.sh -h
NAME
  launcher.OpenMP.sh: Generic OpenMP launcher
    Default APPDIR: /home/users/svarrette/tutorials/OpenMP-MPI/basics/bin
    Default APP: aion_hello_openmp
  Take the good habit to prefix the intel binaries (as foss toolchain is assumed by default)
  with 'intel_'

USAGE
  [sbatch] ./scripts/launcher.OpenMP.sh [-n] {intel | foss } [app]
  EXE=/path/to/multithreadedapp.exe [sbatch] ./scripts/launcher.OpenMP.sh [-n] {intel | foss }

OPTIONS:
  -n --dry-run: Dry run mode

Example:
  [sbatch] ./scripts/launcher.OpenMP.sh                          # run FOSS  build   <cluster>_hello_openmp
  [sbatch] ./scripts/launcher.OpenMP.sh intel                    # run intel build   intel_<cluster>_hello_openmp
  [sbatch] ./scripts/launcher.OpenMP.sh foss matrix_mult_openmp  # run FOSS  build   matrix_mult_openmp
  EXE=/home/users/svarrette/bin/datarace [sbatch] ./scripts/launcher.OpenMP.sh intel # run intel build  ~/bin/datarace

Now you can execute it:

$ ./scripts/launcher.OpenMP.sh
$ ./scripts/launcher.OpenMP.sh intel
$ ./scripts/launcher.OpenMP.sh foss ${ULHPC_CLUSTER}_matrix_mult_openmp

Passive jobs examples on aion:

$> sbatch -c 128 ./scripts/launcher.OpenMP.sh foss  ${ULHPC_CLUSTER}_matrix_mult_openmp
$> sbatch -c 128 ./scripts/launcher.OpenMP.sh intel ${ULHPC_CLUSTER}_matrix_mult_openmp

Check the elapsed time: what do you notice ?

(optional) Hands-on: OpenMP data race benchmark suite

One way to test most of OpenMP feature is to evaluate its execution against a benchmark. For instance, we are going to test OpenMP installation against DataRaceBench, a benchmark suite designed to systematically and quantitatively evaluate the effectiveness of data race detection tools. It includes a set of microbenchmarks with and without data races. Parallelism is represented by OpenMP directives.

$> cd ~/git/github.com/ULHPC/tutorials/parallel/basics
$> make fetch      # clone src/dataracebench
$> cd src/dataracebench

Now you can reserve the nodes and set OMP_NUM_THREADS:

Reserve an interactive job to launch a maximum of OpenMP threads on 1 node (for 1 hour)

# Example on Aion, 128 cores per node
(access)$> si --ntasks-per-node=1 -c 128 -t 1:00:00
$> export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}

Open another terminal (or another tmux/screen window) to monitor the execution (see intructions on top).
Execute the benchmark, for instance using the intel toolchain:

$> module load toolchain/intel
$> ./check-data-races.sh --help

Usage: ./check-data-races.sh [--run] [--help] language

--help      : this option
--small     : compile and test all benchmarks using small parameters with Helgrind, ThreadSanitizer, Archer, Intel inspector.
--run       : compile and run all benchmarks with gcc (no evaluation)
--run-intel : compile and run all benchmarks with Intel compilers (no evaluation)
--helgrind  : compile and test all benchmarks with Helgrind
--tsan-clang: compile and test all benchmarks with clang ThreadSanitizer
--tsan-gcc  : compile and test all benchmarks with gcc ThreadSanitizer
--archer    : compile and test all benchmarks with Archer
--coderrect : compile and test all benchmarks with Coderrect Scanner
--inspector : compile and test all benchmarks with Intel Inspector
--romp      : compile and test all benchmarks with Romp
--llov    : compile and test all benchmarks with LLVM OpenMP Verifier (LLOVE)
--customize : compile and test customized test list and tools

$> ./check-data-races.sh --run-intel C

Useful OpenMP links:

Parallel/Distributed MPI Jobs

The Message Passing Interface (MPI) Standard is a message passing library standard based on the consensus of the MPI Forum. The goal of the Message Passing Interface is to establish a portable, efficient, and flexible standard for message passing that will be widely used for writing message passing programs. MPI is not an IEEE or ISO standard, but has in fact, become the "industry standard" for writing message passing programs on HPC platforms.

Reference website: https://www.mpi-forum.org/
Latest version: 4.0 (June 2021) -- specifications
Below notes are adapted from LLNL MPI tutorial

In the MPI programming model, a computation comprises one or more processes that communicate by calling library routines to send and receive messages to other processes. In most MPI implementations, a fixed set of processes is created at program initialization, and one process is created per processor.

MPI implementations

The UL HPC platform offers to you different MPI implementations:

MPI Suite	`module load`...	Compiler (C)	Compiler (C++)
Intel MPI	`toolchain/intel`	`mpiicc [...]`	`mpiicpc [...]`
OpenMPI	`mpi/OpenMPI`	`mpicc [...]`	`mpic++ [...]`

MPI compilation

MPI Suite	Example Compilation command
Intel MPI	`{mpiicc/mpiicpc} -Wall [-qopenmp] [-xhost] -O2 [...]`
OpenMPI	`{mpicc/mpic++} -Wall [-fopenmp] -O2 [...]`

Of course, it is possible to have hybrid code, mixing MPI and OpenMP primitives.

Slurm reservations and usage for MPI programs

set the number of distributed nodes you want to reserver with -N <N>
set the number of MPI processes per node (that's more explicit) with --ntasks-per-node=<N>
- you can also use -n <N> to specify the total number of MPI processes you want, but the above approach is advised.
(eventually as this is the default) set a single thread per MPI process with -c 1
- except when running an hybrid code...

Important:

To run your MPI program, be aware that Slurm is able to directly launch MPI tasks and initialize of MPI communications via Process Management Interface (PMI) and PMIx
- permits to resolve the task affinity by the scheduler (avoiding to use mpirun --map-by [...])

Simply use (whatever MPI flavor you use):

    srun -n $SLURM_NTASKS /path/to/mpiprog [...]

Thus a minimal launcher for OpenMPI would typically look like that -- see MPI template Launcher

#!/bin/bash -l
#SBATCH -N 2
#SBATCH --ntasks-per-node 128    # MPI processes per node - use 28 on iris
#SBATCH -c 1
#SBATCH --time=0-01:00:00
#SBATCH -p batch

print_error_and_exit() { echo "***ERROR*** $*"; exit 1; }
module purge || print_error_and_exit "No 'module' command"
module load toolchain/foss
module load mpi/OpenMPI
OPTS=$*

srun -n $SLURM_NTASKS /path/to/your/openmpi.app ${OPTS}

In the above example, 2x128 = 256 MPI processes will be launched (matches Aion configuration). You will have to adapt it for running on Iris.

Hands-on: MPI Helloworld and matrix multiplication

You can find in src/hello_mpi.c the traditional MPI "Helloworld" example.

Reserve an interactive job to launch 6 MPI processes across two nodes 2x3 (for 30 minutes)

(access)$> si -N 2 --ntasks-per-node=3 -t 0:30:00

Check and compile the source src/hello_mpi.c to generate:
- bin/openmpi_hello_mpi (compiled with the mpi/OpenMPI module)
- bin/intel_hello_mpi (compiled over the intel toolchain and Intel MPI)

$> cat src/hello_mpi.c
######### OpenMPI
$> module purge                # Safeguard
$> module load mpi/OpenMPI
$> mpicc -Wall -O2 src/hello_mpi.c -o bin/openmpi_${ULHPC_CLUSTER}_hello_mpi

######### Intel MPI
$> module purge                # Safeguard
$> module load toolchain/intel
$> mpiicc -Wall -xhost -O2 src/hello_mpi.c -o bin/intel_${ULHPC_CLUSTER}_hello_mpi

(only if you have trouble to compile): make mpi
Execute the generated binaries multiple times. What do you notice?
Exit your interactive session (exit or CTRL-D)
Prepare a launcher script (use your favorite editor) to execute this application in batch mode -- -- adapt MPI template Launcher

$> sbatch ./launcher.MPI.sh

Repeat the above procedure on a more serious computation: a naive matrix multiplication using MPI, those source code is located in src/matrix_mult_mpi.c

Adapt the launcher script to sustain both executions (MPI helloworld and matrix multiplication)

Note: if you are lazy (or late), you can use the provided launcher script scripts/launcher.MPI.sh.

$ ./scripts/launcher.MPI.sh -h
NAME
  launcher.MPI.sh: Generic MPI launcher
    Default APPDIR: /Users/svarrette/tutorials/OpenMP-MPI/basics/bin
    Default APP: _hello_mpi
  Take the good habit to prefix the binary to execute with MPI suit used for
  the build. Here the default MPI application run would be
        EXE=/Users/svarrette/tutorials/OpenMP-MPI/basics/bin/openmpi__hello_mpi
  which will be run as     srun -n $SLURM_NTASKS [...]

USAGE
  [sbatch] ./scripts/launcher.MPI.sh [-n] {intel | openmpi | mvapich2} [app]
  EXE=/path/to/mpiapp.exe [sbatch] ./scripts/launcher.MPI.sh [-n] {intel | openmpi | mvapich2}

OPTIONS:
  -n --dry-run: Dry run mode

Example:
  [sbatch] ./scripts/launcher.MPI.sh                          # run OpenMPI build    openmpi_<cluster>_hello_mpi
  [sbatch] ./scripts/launcher.MPI.sh intel                    # run Intel MPI build  intel_<cluster>_hello_mpi
  [sbatch] ./scripts/launcher.MPI.sh openmpi matrix_mult_mpi  # run OpenMPI build    openmpi_matrix_mult_mpi
  EXE=/Users/svarrette/bin/xhpl [sbatch] ./scripts/launcher.MPI.sh intel # run intel build  ~/bin/xhpl

Now you can execute it:

$ ./scripts/launcher.MPI.sh
$ ./scripts/launcher.MPI.sh intel
$ ./scripts/launcher.MPI.sh ${ULHPC_CLUSTER}_matrix_mult_mpi
$ ./scripts/launcher.MPI.sh intel ${ULHPC_CLUSTER}_matrix_mult_mpi

Passive jobs examples:

$ sbatch --ntasks-per-node 128 ./scripts/launcher.MPI.sh openmpi
$ sbatch --ntasks-per-node 128 ./scripts/launcher.MPI.sh intel   ${ULHPC_CLUSTER}_matrix_mult_mpi

Check the elapsed time: what do you notice ?

Useful MPI links:

Hybrid OpenMP+MPI Programs

Of course, you can have hybrid code mixing MPI and OpenMP primitives.

You need to compile the code with the -qopenmp (with Intel MPI) or -fopenmp (for the other MPI suits) flags
You need to adapt the OMP_NUM_THREADS environment variable accordingly
- you need to adapt the value -c <N> (or --cpus-per-task <N>) to set the number of OpenMP threads you wish to use per MPI process
- try to inherit from the Slurm allocation (and provide a meaningfull default value): export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
For best performances, you MUST align resource specs with physical NUMA characteristics -- see ULHPC Technical documentation on Slurm Resource Allocation
- aion compute nodes MUST be seen as 8 (virtual) processors of 16 cores each, even if physically the nodes are hosting 2 physical sockets of AMD Epyc ROME 7H12 processors having 64 cores each (total: 128 cores per node).
- iris compute nodes typically hosts 2 physical processors of 14 cores each (total: 28 cores per nodes). The exception are the bigmem nodes (4 physical processors of 28 cores each, total 112 cores per nodes).

Other misc considerations:

You need to ensure the environment variable OMP_NUM_THREADS is shared across the nodes
(Intel MPI only) you probably want to set I_MPI_PIN_DOMAIN=omp
Like any MPI execution, simply use for whatever MPI flavor you use:
```
 srun -n $SLURM_NTASKS /path/to/hybrid [...]
```

Thus a minimal launcher for hybrid OpenMP/MPI would typically look like that -- see Hybrid OpenMP+MPI template Launcher

#!/bin/bash -l     # Multi-node hybrid application IntelMPI+OpenMP launcher
#SBATCH -N 2
#SBATCH --ntasks-per-node   8    # MPI processes per node - use 2 on iris
#SBATCH --ntasks-per-socket 1    # MPI processes per [virtual] processor
#SBATCH -c 16                    # OpenMP threads per MPI process - use 14 on iris
#SBATCH --time=0-01:00:00
#SBATCH -p batch

print_error_and_exit() { echo "***ERROR*** $*"; exit 1; }
module purge || print_error_and_exit "No 'module' command"
module load mpi/OpenMPI   # or toolchain/intel
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
OPTS=$*

srun -n $SLURM_NTASKS /path/to/your/parallel-hybrid-app ${OPTS}

In the above example, 2x8 = 16 MPI processes will be launched, each with 16 OpenMP threads (to match Aion configuration). You will have to adapt it for running on Iris.

Hands-on: Hybrid OpenMP+MPI Helloworld

You can find in src/hello_hybrid.c the traditional OpenMP+MPI "Helloworld" example.

Reserve an interactive job to launch 2 MPI processes (1 per node), each composed of 4 OpenMP threads (for 30 minutes)

(access)$ si -N 2 --ntasks-per-node=1 -c 4 -t 0:30:00
$ export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}

Check the set variable $OMP_NUM_THREADS. Which value do you expect?
```
$> echo $OMP_NUM_THREADS
```
Check and compile the source src/hello_hybrid.c to generate:
- bin/openmpi_hello_hybrid (compiled with the mpi/OpenMPI module)
- bin/intel_hello_hybrid (compiled over the intel toolchain and Intel MPI)

$> cat src/hello_hybrid.c
######### OpenMPI
$> module purge                # Safeguard
$> module load mpi/OpenMPI
$> mpicc -fopenmp -Wall -O2 src/hello_hybrid.c -o bin/openmpi_${ULHPC_CLUSTER}_hello_hybrid

######### Intel MPI
$> module purge                # Safeguard
$> module load toolchain/intel
$> mpiicc -qopenmp -Wall -xhost -O2 src/hello_hybrid.c -o bin/intel_${ULHPC_CLUSTER}_hello_hybrid

(only if you have trouble to compile): make hybrid
Execute the generated binaries (see above tips)
Exit your interactive session (exit or CTRL-D)
Adapt the MPI launcher to allow for batch jobs submissions over hybrid programs

$> sbatch ./launcher.hybrid.sh

Note: if you are lazy (or late), you can use the provided launcher script runs/launcher.hybrid.sh.

$ ./scripts/launcher.hybrid.sh -h
NAME
  launcher.hybrid.sh: Generic Hybrid OpenMP+MPI launcher
    Default APPDIR: /Users/svarrette/tutorials/OpenMP-MPI/basics/bin
    Default APP: _hello_hybrid
  Take the good habit to prefix the binary to execute with MPI suit used for
  the build. Here the default Hybrid OpenMP+MPI application run would be
        EXE=/Users/svarrette/tutorials/OpenMP-MPI/basics/bin/openmpi__hello_hybrid
  which will be run as     srun -n $SLURM_NTASKS [...]

USAGE
  [sbatch] ./scripts/launcher.hybrid.sh [-n] {intel | openmpi | mvapich2} [app]
  EXE=/path/to/hydridapp.exe [sbatch] ./scripts/launcher.hybrid.sh [-n] {intel | openmpi | mvapich2}

OPTIONS:
  -n --dry-run: Dry run mode

Example:
  [sbatch] ./scripts/launcher.hybrid.sh                          # run hybrid OpenMPI build    openmpi_<cluster>_hello_hybrid
  [sbatch] ./scripts/launcher.hybrid.sh intel                    # run hybrid Intel MPI build  intel_<cluster>_hello_hybrid
  [sbatch] ./scripts/launcher.hybrid.sh openmpi matrix_mult      # run hybrid OpenMPI build    openmpi_matrix_mult
  EXE=/Users/svarrette/bin/hpcg [sbatch] ./scripts/launcher.hybrid.sh intel # run hybrid intel build ~/bin/hpcg

Now you can execute it:

$ ./scripts/launcher.hybrid.sh
$ ./scripts/launcher.hybrid.sh intel

Passive jobs examples:

# On Aion, you need to adapt the default settings
$> sbatch --ntasks-per-node 8 -c 16 ./scripts/launcher.hybrid.sh
$> sbatch --ntasks-per-node 8 -c 16 ./scripts/launcher.hybrid.sh intel

Code optimization tips for your OpenMP and/or MPI programs

Consider changing your memory allocation functions to avoid fragmentation and enable scalable concurrency support (this applies for OpenMP and/or MPI programs)
- Facebook's jemalloc
- Google's tcmalloc
When using the intel toolchain:
- see the Step by Step Performance Optimization with Intel(c) C++ Compiler
  - the -xhost option permits to enable processor-specific optimization.
  - you might wish to consider Interprocedural Optimization (IPO) approach, an automatic, multi-step process that allows the compiler to analyze your code to determine where you can benefit from specific optimizations.

Troubleshooting

srun: error: PMK_KVS_Barrier duplicate request from task ...
you are trying to use mpirun (instead of srun) from Intel MPI within a SLURM session and receive such error on mpirun: make sure $I_MPI_PMI_LIBRARY is not set (`unset I_MPI_PMI_LIBRARY``).