Introduction to OpenACC Programming Model (C/C++ and Fortran)

 Copyright (c) E. Krishnasamy, 2013-2023 UL HPC Team <hpc-sysadmins@uni.lu>

Objectives

Understanding the OpenACC programming model
How to use some of the directives from OpenACC to parallelize the code
- compute constructs, loop constructs, data clauses
Implementing OpenACC parallel strategy in C/C++ and FORTRAN programming languages
Simple mathematical examples to support and understand the OpenACC programming model
Finally, show you how to run these examples using Iris cluster (ULHPC) - both interactively and using a batch job script

Prerequisite:

C/C++ and/or FORTRAN languages
OpenMP or some basic parallel programming concept (advantage not necessary)

Important: PRACE MOOC

NOTE: This lecture is limited to just 45 min; it only a covers from basics to intermediate tutorial about OpenACC. To know more about (from basic to advanced) CUDA programming and OpenACC programming model, please refer to PRACE MOOC GPU Programming for Scientific Computing and Beyond - Dr. Ezhilmathi Krishnasamy and Prof. Pascal Bouvry

Pre-requisites

Ensure you are able to connect to the UL HPC clusters. In particular, recall that the module command is not available on the access frontends. For all tests and compilation, you MUST work on a computing node

Now you'll need to pull the latest changes in your working copy of the ULHPC/tutorials you should have cloned in ~/git/github.com/ULHPC/tutorials (see "preliminaries" tutorial)

(access)$ cd ~/git/github.com/ULHPC/tutorials
(access)$ git pull

Now configure a dedicated directory ~/tutorials/openacc for this session

# return to your home
(access)$ mkdir -p ~/tutorials/openacc
(access)$ cd ~/tutorials/openacc
# create a symbolic link to the top reference material
(access)$ ln -s ~/git/github.com/ULHPC/tutorials/gpu/openacc/basics ref.d # Symlink to the reference tutorial material
# copy / synchronize a copy of the exercises
(access)$ rsync -avzu ref.d/exercises .   # DO NOT forget the trailing .

Advanced users (eventually yet strongly recommended), create a Tmux session (see Tmux cheat sheet and tutorial) or GNU Screen session you can recover later. See also "Getting Started" tutorial .

Connect to Iris cluster and get an interactive GPU job

Load the OpenACC compiler (Nvidia HPC SDK)

$> module spider nvhpc
$> module load compiler/NVHPC/21.2

Difference between CPU and GPU

CPU vs GPU

A CPU frequency is higher compared to a GPU.
But a GPU can run many threads in parallel compared to a CPU.
On a GPU, the cores are grouped and called "Streaming Multiprocessor - SM".
Even the Nvidia GPU has a "Tensor Process Unit - TPU" to handle the AI/ML computations in an optimized way.
GPUs are based on the "Single Instruction Multiple Threads".
Threads are executed in a group on the GPU; typically they have 32 threads. This is called "warps" on the Nvidia GPU and "wavefronts" on the AMD GPU.

CPU vs GPU

Thread Hierarchy

How GPUs are used for computations

Step 1: application preparation, initialize the memories on both CPU and GPU
Step 2: transfer the data to a GPU
Step 3: do the computation on a GPU
Step 4: transfer the data back to a CPU
Step 5: finalize the application and delete the memories on both CPU and GPU

OpenACC

Few points about OpenACC

OpenACC is not GPU programming
OpenACC is expressing the parallelism in your existing code
OpenACC can be used in both Nvidia and AMD GPUs
“OpenACC will enable programmers to easily develop portable applications that maximize the performance and power efficiency benefits of the hybrid CPU/GPU architecture of Titan.”
- Buddy Bland, Titan Project Director, Oak Ridge National Lab
“OpenACC is a technically impressive initiative brought together by members of the OpenMP Working Group on Accelerators, as well as many others. We look forward to releasing a version of this proposal in the next release of OpenMP.”
- Michael Wong, CEO of OpenMP Directives Board

Ways to accelerate applications on the GPU

Libraries: easy to use with very limited knowledge with GPU programming
- cuBLAS, cuFFT, CUDA Math Library, etc.
Directive based programming model: will accelerate the application with using directives in the existing code
- OpenACC and OpenMP (might be applicable in the future)
Programming languages: low level programming languages that will further optimize the application on the accelerator
- CUDA, OpenCL, etc.

Compilers and directives (only a few of them are listed here)

OpenACC is supported by the Nvidia, PGI, GCC, and HPE Gray (only for FORTRAN) compilers
- Now PGI is part of Nvidia, and it is available through Nvidia HPC SDK
Compute constructs:
- parallel and kernel
Loop constructs:
- loop, collapse, gang, worker, vector, etc.
Data management clauses:
- copy, create, copyin, copyout, delete and present
Others:
- reduction, atomic, cache, etc.
More information about the OpenACC directives can be found in here

Basic programming structure

// C/C++
#include "openacc.h"
#pragma acc <directive> [clauses [[,] clause] . . .] new-line
<code>

!! Fortran
use openacc
!$acc <directive> [clauses [[,] clause] . . .]
<code>

Compute and loop constructs in OpenACC

kernels in C/C++

// Hello_World.c                    | // Hello_World_OpenACC.c
void Print_Hello_World()            | void Print_Hello_World()
{                                   | {
                                    | #pragma acc kernels
  for(int i = 0; i < 5; i++)        |    for(int i = 0; i < 5; i++)
    {                               |      {
      printf("Hello World!\n");     |        printf("Hello World!\n");
    }                               |      }
}                                   | }

compilation: nvc -fast -Minfo=all -acc=gpu -gpu=cc70 Hello_World.c
- The compiler will already give much info; what do you see?

$> nvc -fast -Minfo=all -acc=gpu -gpu=cc70 Hello_World.c
Print_Hello_World:
      6, Loop not parallelized: contains call
main:
     14, Print_Hello_World inlined, size=4 (inline) file Hello_World.c (5)
           6, Loop not vectorized/parallelized: contains call

Now add either kernels or parallel directives to vectorize/parallelize the loop

$> nvc -fast -Minfo=all -acc=gpu -gpu=cc70 Hello_World_OpenACC.c

print_hello_world:
      6, Loop is parallelizable
         Generating Tesla code
          6, !$acc loop gang, vector(32) ! blockidx%x threadidx%x

As we can see above the loop is parallelized!.

kernels in FORTRAN

!! Hello_World.f90                 | !! Hello_World_OpenACC.f90
subroutine Print_Hello_World()     | subroutine Print_Hello_World()
  integer :: i                     |   integer :: i
                                   |   !$acc kernels
  do i = 1, 5                      |   do i = 1, 5
     print *, "hello world"        |     print *, "hello world"
  end do                           |   end do
                                   |  !$acc end kernels
end subroutine Print_Hello_World   | end subroutine Print_Hello_World

Compile the Hello_World.f90 and compiler tells us that the loop is not parallelized.

$> nvfortran -fast -Minfo=all -acc=gpu -gpu=cc70 Hello_World.f90

print_hello_world:
      5, Loop not vectorized/parallelized: contains call

Now run the Hello_World_OpenACC.f90 either using kernels or parallel and we can already notice that loop is vectorized/parallelized.

$> nvfortran -fast -Minfo=all -acc=gpu -gpu=cc70 Hello_World_OpenACC.f90

print_hello_world:
      6, Loop is parallelizable
         Generating Tesla code
          6, !$acc loop gang, vector(32) ! blockidx%x threadidx%x

Note: this above example shows you how to create the parallel region using parallel and kernels. It is quite useful when multiple regions need to be parallelized. However, the above example has just one parallel region.

loop and data management clauses

for C/C++

Here we can consider simple vector addition example for the OpenACC loop directive

// Vector_Addition.c                  | // Vector_Addition_OpenACC.c
float * Vector_Addition               | float * Vector_Addition
(float *restrict a, float *restrict b,| (float *restrict a, float *restrict b, 
float *restrict c, int n)             | float *restrict c, int n)
{                                     | {
                                      | #pragma acc kernels loop
                                      | copyin(a[:n], b[0:n]) copyout(c[0:n])
  for(int i = 0; i < n; i ++)         |   for(int i = 0; i < n; i ++)
    {                                 |     {
      c[i] = a[i] + b[i];             |       c[i] = a[i] + b[i];
    }                                 |     }
  return c;                           |
}                                     | }

The loop will parallelize the for loop plus also accommodate other OpenACC clauses, for example here copyin and copyout.
The above example needs two vectors to be copied to GPU and one vector needs to send the value back to CPU.
copyin will create the memory on the GPU and transfer the data from CPU to GPU.
copyout will create the memory on the GPU and transfer the data from GPU to CPU.

for FORTRAN


!! Vector_Addition.f90                       | !! Vector_Addition_OpenACC.f90
module Vector_Addition_Mod                   | module Vector_Addition_Mod
  implicit none                              |   implicit none
contains                                     | contains
  subroutine Vector_Addition(a, b, c, n)     |   subroutine Vector_Addition(a, b, c, n)
    !! Input vectors                         |     !! Input vectors
    real(8), intent(in), dimension(:) :: a   |     real(8), intent(in), dimension(:) :: a
    real(8), intent(in), dimension(:) :: b   |     real(8), intent(in), dimension(:) :: b
    real(8), intent(out), dimension(:) :: c  |     real(8), intent(out), dimension(:) :: c
    integer :: i, n                          |     integer :: i, n
                                             |     !$acc kernels loop copyin(a(1:n), b(1:n))
                                             |      copyout(c(1:n))
    do i = 1, n                              |     do i = 1, n
       c(i) = a(i) + b(i)                    |        c(i) = a(i) + b(i)
    end do                                   |     end do
                                             |     !$acc end kernels
  end subroutine Vector_Addition             |   end subroutine Vector_Addition
end module Vector_Addition_Mod               | end module Vector_Addition_Mod

Now compile and run the above code as we did previously.

reduction clause in dot product

for C/C++

// Matrix_Multiplication.c
for(int row = 0; row < width ; ++row)           // Matrix_Multiplication_OpenACC.c
    {                                           | pragma acc kernels loop collapse(2)                      
      for(int col = 0; col < width ; ++col)     |                        reduction(+:sum)                      
        {                                       |   for(int row = 0; row < width ; ++row)                     
          sum=0;                                |      {                    
          for(int i = 0; i < width ; ++i)       |        for(int col = 0; col < width ; ++col)                     
            {                                   |          {                     
              sum += a[row*width+i]             |            sum=0;
                   * b[i*width+col];            |            for(int i = 0; i < width ; ++i)    
            }                                   |              {                     
          c[row*width+col] = sum;               |                sum += a[row*width+i]                     
        }                                       |                     * b[i*width+col];                     
    }                                           |              }
                                                |            c[row*width+col] = sum;
                                                |          }
                                                |      }                                
=======
// Vector_Addition.c                  | // Vector_Addition_OpenACC.c
float * Vector_Addition               | float * Vector_Addition
(float *restrict a, float *restrict b,| (float *restrict a, float *restrict b, 
float *restrict c, int n)             | float *restrict c, int n)
{                                     | { float sum=0;
                                      | #pragma acc kernels loop
                                      | reduction(+:sum) copyin(a[:n], b[0:n]) copyout(c[0:n])
  for(int i = 0; i < n; i ++)         |   for(int i = 0; i < n; i ++)
    {                                 |     {
      c[i] = a[i] + b[i];             |       c[i] = a[i] + b[i];
    }                                 |       sum+=c[i];
  return c;                           |     }
}                                     | }

for FORTRAN

!! Matrix_Multiplication.f90                 |!! Matrix_Multiplication_OpenACC.f90
    do row = 0, width-1                      |   !$acc loop collapse(2) reduction(+:sum)
       do col = 0, width-1                   |   do row = 0, width-1
          sum=0                              |      do col = 0, width-1                           
          do i = 0, width-1                  |         sum=0                           
             sum = sum + (a((row*width)+i+1) |         do i = 0, width-1
                       * b((i*width)+col+1)) |            sum = sum + (a((row*width)+i+1)  
          enddo                              |                      * b((i*width)+col+1))                        
          c(row*width+col+1) = sum           |         enddo                        
       enddo                                 |         c(row*width+col+1) = sum
    enddo                                    |      enddo
                                             |   enddo 
                                             |   !$acc end loop

reduction clause is needed when we want to sum the array or any counting inside the parallel region; this will increase the performance and avoid the error in the total sum.
The above example shows how to use them in C/C++ and FORTRAN languages.

Practical session:

Try simple Hello_World_OpenACC.c and Hello_World_OpenACC.f90 with OpenACC parallel constructs; and try to understand what compiler producing.
Do simple Vector_Addition_OpenACC.c and Vector_Addition_OpenACC.f90 with OpenACC parallel constructs and use data clauses for data management.
Similarly, try Vector_Addition_OpenACC.c and Vector_Addition_OpenACC.f90 with parallel OpenACC parallel constructs and use data clauses for data management. And include thread clauses for creating threads.
Finally, do Matrix_Multiplication_OpenACC.c and Matrix_Multiplication_OpenACC.f90 and use reduction clause along with parallel constructs and data management clauses.
Similarly include the thread blocks for Matrix_Multiplication_OpenACC.c and Matrix_Multiplication_OpenACC.f90.

!To follow up more about CUDA and OpenACC programming, please visit:MOOC course: GPU programming for scientific computing and beyond!