# Image Convolution with GPU and CUDA

 Copyright (c) 2020-2021 L. Koutsantonis UL HPC Team <hpc-team@uni.lu> This tutorial will cover the following aspects of CUDA programming:

1. GPU Global Memory Allocation
2. Dynamic Shared Memory Allocation
3. Thread Indexing
4. Thread Synchronization

## Pre-requisites

Ensure you are able to connect to the UL HPC clusters. In particular, recall that the module command is not available on the access frontends.

### Access to ULHPC cluster - here iris
(laptop)$> ssh iris-cluster # /!\ Advanced (but recommended) best-practice: # always work within an GNU Screen session named with 'screen -S <topic>' (Adapt accordingly) # IIF not yet done, copy ULHPC .screenrc in your home (access)$> cp /etc/dotfiles.d/screen/.screenrc ~/


Now you'll need to pull the latest changes in your working copy of the ULHPC/tutorials you should have cloned in ~/git/github.com/ULHPC/tutorials (see "preliminaries" tutorial)

(access)$> cd ~/git/github.com/ULHPC/tutorials (access)$> git pull


You should have followed the Introduction to GPU programming with CUDA

## Laplacian of Gaussian (LoG): A convolution kernel for edge detection

• Derivative Filter used to find rapid changes in signals and especially images

• Used for edge detection and noise detection

• Mathematical Formula:

H(x,y) = \frac{-1}{\pi \sigma ^4}(1-\frac{x^2+y^2}{2\sigma ^2})e^{-\frac{x^2+y^2}{2\sigma^2}}  ## Convolution Operator

The discrete convolution operator is define by the double sum:

G_{m,n} = F *H = \sum_i \sum_j {F_{m-i, n-j} H_{i,j} }

where $F$ is the original image, $H$ is the convolution kernel and $G$ is the resulted image. ## CPU Implementation

A serial code implementing the image convolution on a CPU employs two loops to compute the values of the pixels of the output image. The convolution operator is calculated at each iteration for each image pixel using the double sum provided in the equation above.

//CPU function: conv_img_cpu
//Parameters: float *img, float *kernel, float *imgf, int Nx, int Ny, int kernel_size
//center: center of kernel
for (int i = center; i<(Ny-center); i++)
for (int j = center; j<(Nx-center); j++){
//Convolution Operator:
sum = 0;
for (int ki = 0; ki<kernel_size; ki++)
for (int kj = 0; kj<kernel_size; kj++){
ii = j + kj - center;
jj = i + ki - center;
sum+=img[jj*Nx+ii]*kernel[ki*kernel_size + kj];
}
imgf[i*Nx +j] = sum;
}


## GPU Implementation

The parallel implementation of convolution of GPU is described by the following figure. Multiple threads are used to calculate the convolution operator of multiple pixels simultaneously. The total number of calculated pixels at each step will be equal to the total number of launched threads ($Number of Blocks \times Block Threads$). Each thread having access to the coefficients of the convolution kernel calculates the double sum of the convolution operator. The kernel coefficients being constant during the whole execution can be stored into the shared memory (accessible by the block threads). ## Hands On Image Convolution with CUDA

### Get the source files

Task 1: If you do not have yet the UL HPC tutorial repository, clone it. Update to the latest version.

ssh iris-cluster
mkdir -p ~/git/github.com/ULHPC
cd  ~/git/github.com/ULHPC
git clone https://github.com/ULHPC/tutorials.git
cd tutorials/cuda/exercises/convolution
git stash && git pull -r && git stash pop


### Get an interactive GPU job

### ... either directly - dedicate 1/4 of available cores to the management of GPU card

## Additional Exercises

### Profiling

You can use nvprof (documentation) to profile your GPU application.

To profile your application simply:

nvprof ./$exe # you can also add --log-file prof  ### Experimentation with convolution parameters Try to change the parameters sigma and kernel_size in your main function. Try to use a large enough convolution kernel size. Compile the modified source code and use nvprof to profile your application. Do you observe any difference in the execution time of the GPU kernel? Try to implement the GPU kernel without using the shared memory. In this case the CUDA kernel is implemented as follows:  if (idx<Nx*Ny){ for (int ki = 0; ki<kernel_size; ki++) for (int kj = 0; kj<kernel_size; kj++){ ii = kj + ix - center; jj = ki + iy - center; sum+=img[jj*Nx+ii]*kernel[ki*kernel_size + kj]; } imgf[idx] = sum; }  Again, recompile your source file and profile your application. Can you observe any difference in the execution time of the GPU kernel? ### Plotting your convoluted image A jupyter notebook show_images.ipynb is available for plotting your results. You can access the notebook directly from the iris-cluster by building a virtual environment for the jupyter notebook. First connect to the iris-cluster with a SOCK proxy opened on port 1080: ssh -D 1080 iris-cluster  Reserve a single node interactively: si-gpu  Prepare virtual environment for the notebook: module load lang/Python/3.7.2-GCCcore-8.2.0 python -m venv ~/venv/jupyter source ~/venv/jupyter/bin/activate pip install -U pip pip install jupyter python -m ipykernel install --user --name=jupiter pip install matplotlib pip install numpy  Generate your jupyter notebook: jupyter notebook --generate-config  You can protect your notebook with a password (not required): jupyter notebook password  Launch your notebook: jupyter notebook --ip$(ip addr | grep '172.17' | grep 'inet ' | awk '{print $2}' | cut -d/ -f1) --no-browser  The IP address of the node launcing the notebook is shown: []> Jupyter Notebook 6.1.5 is running at: []> http://172.17.6.196:8888/  On your local machine, you can access the notebook using an SSH tunnel. Open a terminal and type: ssh -NL 8000:$IP iris-cluster


where \$IP is the IP address of the node where the jupyter notebook was launched.

Open a browser and access your jupyter notebook on your localhost: http://127.0.0.1:8000. Run the python script to reproduce the results of your CUDA application.