# Image Convolution with GPU and CUDA

Copyright (c) 2020-2021 L. Koutsantonis UL HPC Team <hpc-team@uni.lu>

This tutorial will cover the following aspects of CUDA programming:

1. GPU Global Memory Allocation
2. Dynamic Shared Memory Allocation

## Pre-requisites

Ensure you are able to connect to the UL HPC clusters. In particular, recall that the module command is not available on the access frontends.

### Access to ULHPC cluster - here iris
(laptop)$> ssh iris-cluster # /!\ Advanced (but recommended) best-practice: # always work within an GNU Screen session named with 'screen -S <topic>' (Adapt accordingly) # IIF not yet done, copy ULHPC .screenrc in your home (access)$> cp /etc/dotfiles.d/screen/.screenrc ~/

Now you'll need to pull the latest changes in your working copy of the ULHPC/tutorials you should have cloned in ~/git/github.com/ULHPC/tutorials (see "preliminaries" tutorial)

(access)$> cd ~/git/github.com/ULHPC/tutorials (access)$> git pull

You should have followed the Introduction to GPU programming with CUDA

## Laplacian of Gaussian (LoG): A convolution kernel for edge detection

• Derivative Filter used to find rapid changes in signals and especially images

• Used for edge detection and noise detection

• Mathematical Formula:

H(x,y) = \frac{-1}{\pi \sigma ^4}(1-\frac{x^2+y^2}{2\sigma ^2})e^{-\frac{x^2+y^2}{2\sigma^2}}

## Convolution Operator

The discrete convolution operator is define by the double sum:

G_{m,n} = F *H = \sum_i \sum_j {F_{m-i, n-j} H_{i,j} }

where $F$ is the original image, $H$ is the convolution kernel and $G$ is the resulted image.

## CPU Implementation

A serial code implementing the image convolution on a CPU employs two loops to compute the values of the pixels of the output image. The convolution operator is calculated at each iteration for each image pixel using the double sum provided in the equation above.

//CPU function: conv_img_cpu
//Parameters: float *img, float *kernel, float *imgf, int Nx, int Ny, int kernel_size
//center: center of kernel
for (int i = center; i<(Ny-center); i++)
for (int j = center; j<(Nx-center); j++){
//Convolution Operator:
sum = 0;
for (int ki = 0; ki<kernel_size; ki++)
for (int kj = 0; kj<kernel_size; kj++){
ii = j + kj - center;
jj = i + ki - center;
sum+=img[jj*Nx+ii]*kernel[ki*kernel_size + kj];
}
imgf[i*Nx +j] = sum;
}

## GPU Implementation

The parallel implementation of convolution of GPU is described by the following figure. Multiple threads are used to calculate the convolution operator of multiple pixels simultaneously. The total number of calculated pixels at each step will be equal to the total number of launched threads ($Number of Blocks \times Block Threads$). Each thread having access to the coefficients of the convolution kernel calculates the double sum of the convolution operator. The kernel coefficients being constant during the whole execution can be stored into the shared memory (accessible by the block threads).

## Hands On Image Convolution with CUDA

### Get the source files

Task 1: If you do not have yet the UL HPC tutorial repository, clone it. Update to the latest version.

ssh iris-cluster
mkdir -p ~/git/github.com/ULHPC
cd  ~/git/github.com/ULHPC
git clone https://github.com/ULHPC/tutorials.git
cd tutorials/cuda/exercises/convolution
git stash && git pull -r && git stash pop

### Get an interactive GPU job

### ... either directly - dedicate 1/4 of available cores to the management of GPU card