Image Convolution with GPU and CUDA

 Copyright (c) 2020-2021 L. Koutsantonis UL HPC Team <hpc-team@uni.lu>

This tutorial will cover the following aspects of CUDA programming:

GPU Global Memory Allocation
Dynamic Shared Memory Allocation
Thread Indexing
Thread Synchronization

Pre-requisites

Ensure you are able to connect to the UL HPC clusters. In particular, recall that the module command is not available on the access frontends.

### Access to ULHPC cluster - here iris
(laptop)$> ssh iris-cluster
# /!\ Advanced (but recommended) best-practice:
#    always work within an GNU Screen session named with 'screen -S <topic>' (Adapt accordingly)
# IIF not yet done, copy ULHPC .screenrc in your home
(access)$> cp /etc/dotfiles.d/screen/.screenrc ~/

Now you'll need to pull the latest changes in your working copy of the ULHPC/tutorials you should have cloned in ~/git/github.com/ULHPC/tutorials (see "preliminaries" tutorial)

(access)$> cd ~/git/github.com/ULHPC/tutorials
(access)$> git pull

You should have followed the Introduction to GPU programming with CUDA

Laplacian of Gaussian (LoG): A convolution kernel for edge detection

Derivative Filter used to find rapid changes in signals and especially images
Used for edge detection and noise detection
Mathematical Formula:

$H(x,y) = \frac{-1}{\pi \sigma ^4}(1-\frac{x^2+y^2}{2\sigma ^2})e^{-\frac{x^2+y^2}{2\sigma^2}}$

Original Image

Convoluted Image

Convolution Operator

The discrete convolution operator is define by the double sum:

$G_{m,n} = F *H = \sum_i \sum_j {F_{m-i, n-j} H_{i,j} }$

where $F$ is the original image, $H$ is the convolution kernel and $G$ is the resulted image. Convoluted Image

CPU Implementation

A serial code implementing the image convolution on a CPU employs two loops to compute the values of the pixels of the output image. The convolution operator is calculated at each iteration for each image pixel using the double sum provided in the equation above.

//CPU function: conv_img_cpu
//Parameters: float *img, float *kernel, float *imgf, int Nx, int Ny, int kernel_size
//center: center of kernel
  for (int i = center; i<(Ny-center); i++)
    for (int j = center; j<(Nx-center); j++){
        //Convolution Operator:
        sum = 0;
        for (int ki = 0; ki<kernel_size; ki++)
           for (int kj = 0; kj<kernel_size; kj++){
               ii = j + kj - center;
               jj = i + ki - center;
               sum+=img[jj*Nx+ii]*kernel[ki*kernel_size + kj];
               }
        imgf[i*Nx +j] = sum;
        }

GPU Implementation

The parallel implementation of convolution of GPU is described by the following figure. Multiple threads are used to calculate the convolution operator of multiple pixels simultaneously. The total number of calculated pixels at each step will be equal to the total number of launched threads ( $Number of Blocks \times Block Threads$ ). Each thread having access to the coefficients of the convolution kernel calculates the double sum of the convolution operator. The kernel coefficients being constant during the whole execution can be stored into the shared memory (accessible by the block threads).

Convoluted Image

Hands On Image Convolution with CUDA

Get the source files

Task 1: If you do not have yet the UL HPC tutorial repository, clone it. Update to the latest version.

ssh iris-cluster
mkdir -p ~/git/github.com/ULHPC
cd  ~/git/github.com/ULHPC
git clone https://github.com/ULHPC/tutorials.git
cd tutorials/cuda/exercises/convolution
git stash && git pull -r && git stash pop

Get an interactive GPU job

A CUDA kernel for the Convolution Operator

Task 2: Following the steps 1 to 3 provided bellow write a CUDA kernel for the computation of the convolution operator.

Open the source file LoG_gpu_exercise.cu with your favorite editor (e.g. emacs LoG_gpu_exercise.cu). The CUDA kernel is already defined:

void conv_img_cpu(float *img, float *kernel, float *imgf, int Nx, int Ny, int kernel_size)

where *img is a pointer to the original image vector, *kernel is a pointer to the convolution kernel vector, *imgf is a pointer to the convoluted image, Nx and Ny are the dimensions of both the original and convoluted image, and kernel_size is the dimension of the convolution kernel.

Step 1. CUDA Threads and Blocks indices

The CUDA kernel will be executed by each thread. Thus, a mapping mechanism is needed for each thread to compute a specific pixel of the output image and store the result to the corresponding memory location.

CUDA kernels have access to device variables identifying both the thread index within the block and the block index. These variables are threadIdx.x and blockIdx.x respectively. In this example, each block of threads will compute a row of the output image and each block thread will compute a single pixel value on this row. Thus, the index of a pixel in the image can be defined throught:

//each block is assigned to a row of an image, iy integer index of y
  int iy = blockIdx.x + (kernel_size - 1)/2;

  //each thread is assigned to a pixel of a row, ix integer index of x
  int ix = threadIdx.x + (kernel_size - 1)/2;

The offset (kernel_size - 1)/2 is added to the iy, ix variables as the convolution will not be computed for the image pixels lying at the boundary layers of the original image (computations are performed only when the discrete filter kernel lies completely within the original image). We can define the center of the convolution kernel as it will be used in differnet calculations:

//center of kernel in both dimensions
  int center = (kernel_size -1)/2;

It is important to say that the kernel_size must be an odd number so that its center has an integer value.

For each block thread, the memory location of the corresponding pixel can be calculated by:

int idx = iy*Nx +ix;

Step 2. Dynamically Allocated Shared Memory

Shared memory is allocated per thread block, so all threads in the block have access to the same shared memory. The best practice is to use the shared memory for parameters that remain constant during the execution of the CUDA kernel and used in multiple calculations. In our example, these parameters are the coefficient of the convolution kernel. To statically allocate shared memory we use:

__shared__ float sdata[9];

where in this case the kernel_size is 9. As we want our kernel to run with convolution kernels of different size, we can dynamically allocate shared memory using the prefix extern:

extern __shared__ float sdata[];

where in this case, the size of the allocated shared memory is passed through the launch of the kernel (from the main function).

For each block, the block threads are used to copy the coefficients of the convolution kernel from the global memory to the shared memory. The index of each thread is used to define the location in shared memory where the convolution coefficients from the global memory will be copied. It also provides the location where these coefficients are stored in the global memory. For our case, these mappings are identical:

int tid = threadIdx.x;
int K2 = kernel_size*kernel_size;
extern __shared__ float sdata[];
if (tid<K2)
    sdata[tid] = kernel[tid];
 __syncthreads();

As the vector containing the convolution coefficients is of size of K2 = kernel_size*kernel_size, only the threads with index in the range[0 ,K2) can access the global and shared memory. This is ensured using the if (tid<K2) statement. At the end of the transfer we have to syncronize all block threads using ___syncthreads(). This synchronization is required to ensure that all of the block threads will perform calculations after all convolution coefficients are completely transfered to the shared memory.

Step 3. Calculate the output image

Each thread calculates the corresponding pixel value of the convoluted image. The convolution operation is performed through a nested loop implementing a double summation. At each iteration, each block thread calculates the multiplication of a pixel value of the original image lying within the convolution window with the corresponding coefficient of the convolution kernel stored in shared memory (see image above). The result is used to update the value of a local summation (sum). The value of the output pixel is the result of the double summation. The CUDA code implementing the above procedure is given bellow:

if (idx<Nx*Ny){
    for (int ki = 0; ki<kernel_size; ki++)
        for (int kj = 0; kj<kernel_size; kj++){
           ii = kj + ix - center;
           jj = ki + iy - center;
    sum+=img[jj*Nx+ii]*sdata[ki*kernel_size + kj];
    }
    imgf[idx] = sum;
  }

As you can see the code is exactly the same with the code implementing the convolution on a CPU. What is missing is the double for loop running on the pixels of the output image.

In the above code, the if (idx<Nx*Ny) statement is used to ensure that each thread has access to an allocated memory location.

Complete the main function

In main, the arrays img, imgf and kernel storing the original images pixel values, the convoluted image pixel values and the convolution kernel coefficients respectively are float arrays allocated in the host memory (CPU). The original image is preloaded into the img through the host function:

void load_image(char *fname, int Nx, int Ny, float  *img)

The convolution kernel coefficients are calculated for a given sigma value sigma and convolution kernel size kernel_size through the host function:

void calculate_kernel(int kernel_size, float sigma, float *kernel)

Allocate Device Memory

The array d_img storing the original image pixel values in device global memory is allocated using:

cudaMalloc(&d_img,Nx*Ny*sizeof(float));

Task 3a Based on the above instruction allocate device memory for the arrays d_imgf and d_kernel storing the output image pixels and convolution kernel coeffients respectively. The size of the output image in Nx \times Ny and the size of the convolution kernel is kernel_size \times kernel_size.

Transfer the filter coefficients from the host memory to the device memory

The original image pixel values are copied from the host memory to the device memory through:

cudaMemcpy(d_img, img, Nx*Ny*sizeof(float),cudaMemcpyHostToDevice);

Task 3b Utilize the above instruction to tranfer the convolution kernel coefficients from the host memory kernel to the global memory d_kernel.

Configure the Execution and Launch the CUDA kernel

The kernel code will by executed by every thread after its launch. The total number of blocks Nblocks will be equal to the number of the calculated rows of the output image (Ny - kernel_size + 1) and the number of block threads Nthreads equal to the number of the calculated pixel elements per row (Nx - kernel size + 1). In general, a CUDA kernel is configured and launched using the following instruction:

my_gpu_kernel<<<Nblocks, Nthreads>>>(vars);

When shared memory is dynamically allocated within the kernel, an extra argument size defining the number of bytes of the shared memory that is dynamically allocated per thread block has to be passed through the configuration of the kernel:

conv_img_gpu<<<Nblocks, Nthreads, size>>>(d_img, d_kernel, d_imgf, Nx, Ny, kernel_size);

In our case, the size of the allocated shared memory is equal to the size of the convolution kernel Size = kernel_size*kernel_size*sizeof(float).

Task 4 Configure the Execution and Launch the GPU kernel from the main function.

De-allocation of host and device memory

At the end of main, the host memory is deallocated with free(var). Similarly, the device memory is deallocated with cudaFree(var).

  free(img);
  free(imgf);
  free(kernel);

  cudaFree(d_img);
  cudaFree(d_imgf);
  cudaFree(d_kernel);

Compile and run your code

The NVIDIA CUDA compiler 'nvcc' is used to compile the source code containing both the host and device functions. The non CUDA part of the code will be forwarded to a general purpose host compiler (e.g. gcc). As you have seen, the GPU functionsare declared using some annotations (e.g. __global__, __device__) distinguishing them from the host code.

In simple words, you can compile your code using:

nvcc -arch=compute_70 -o ./$exe $src

where nvcc is the keyword for the nvcc compiler, $src is the name of the source file to compile ( e.g.LoG_gpu.cu is passed as the source file to compile, the o flag is used to specify the name $exe of the compiled program, and the arch points to the GPU architecture for which the source file must be compiled. sm_70 indicates the Volta GPU architecture. Use -lm to link the math library to your executable.

To run your executable file interactively, just use:

./$exe $arg1 $arg2 ... $argn

where $arg1, $arg2, ..., $argn are the appropriate arguments (if any).

Additional Exercises

Profiling

You can use nvprof (documentation) to profile your GPU application.

To profile your application simply:

nvprof ./$exe  # you can also add --log-file prof

Experimentation with convolution parameters

Try to change the parameters sigma and kernel_size in your main function. Try to use a large enough convolution kernel size. Compile the modified source code and use nvprof to profile your application. Do you observe any difference in the execution time of the GPU kernel?

Try to implement the GPU kernel without using the shared memory. In this case the CUDA kernel is implemented as follows:


if (idx<Nx*Ny){
    for (int ki = 0; ki<kernel_size; ki++)
        for (int kj = 0; kj<kernel_size; kj++){
           ii = kj + ix - center;
           jj = ki + iy - center;
    sum+=img[jj*Nx+ii]*kernel[ki*kernel_size + kj];
    }
    imgf[idx] = sum;
  }

Again, recompile your source file and profile your application. Can you observe any difference in the execution time of the GPU kernel?

Plotting your convoluted image

A jupyter notebook show_images.ipynb is available for plotting your results. You can access the notebook directly from the iris-cluster by building a virtual environment for the jupyter notebook.

First connect to the iris-cluster with a SOCK proxy opened on port 1080:

ssh -D 1080 iris-cluster

Reserve a single node interactively:

si-gpu

Prepare virtual environment for the notebook:

module load lang/Python/3.7.2-GCCcore-8.2.0
python -m venv ~/venv/jupyter
source ~/venv/jupyter/bin/activate
pip install -U pip
pip install jupyter
python -m ipykernel install --user --name=jupiter
pip install matplotlib
pip install numpy

Generate your jupyter notebook:

jupyter notebook --generate-config

You can protect your notebook with a password (not required):

jupyter notebook password

Launch your notebook:

jupyter notebook --ip $(ip addr | grep '172.17' | grep 'inet ' | awk '{print $2}' | cut -d/ -f1) --no-browser

The IP address of the node launcing the notebook is shown:

[]> Jupyter Notebook 6.1.5 is running at:
[]> http://172.17.6.196:8888/

On your local machine, you can access the notebook using an SSH tunnel. Open a terminal and type:

ssh -NL 8000:$IP iris-cluster

where $IP is the IP address of the node where the jupyter notebook was launched.

Open a browser and access your jupyter notebook on your localhost: http://127.0.0.1:8000. Run the python script to reproduce the results of your CUDA application.