Convolution

This guide is meant to get you started with writing your tests and tuning scripts using Kernel Tuner. We’ll use a simple 2D Convolution kernel as an example kernel, but as you will find out shortly, much of the scripts that you write with Kernel Tuner can be reused for testing and tuning other kernels.

Note: If you are reading this guide on the Kernel Tuner’s documentation pages, note that you can actually run this guide as a Jupyter Notebook. Just clone the Kernel Tuner’s GitHub repository. Install using pip install .[tutorial,cuda] and you’re ready to go! You can start the guide by typing “jupyter notebook” in the “kernel_tuner/doc/source” directory.

2D Convolution example

Convolution operations are essential to signal and image processing applications and are the main operation in convolutional neural networks used for deep learning. A convolution operation computes the linear combination of the weights in a convolution filter and a range of pixels from the input image for each output pixel. A 2D convolution of an input image \(I\) of size \((w\times h)\) and a convolution filter \(F\) of size \((F_w\times F_h)\) computes an output image \(O\) of size \(((w-F_w)\times (h-F_h))\): \begin{equation}\nonumber O(x,y) = \sum\limits_{j=0}^{F_h} \sum\limits_{i=0}^{F_w} I(x+i,y+j)\times F(i,j) \end{equation}

A naive CUDA kernel for 2D Convolution parallelizes the operation by creating one thread for each output pixel. Note that to avoid confusion around the term kernel, we refer to the convolution filter as a filter.

The kernel code is shown in the following code block, make sure you execute all code blocks in this guide by selecting them and pressing shift+enter:

[ ]:
%%writefile convolution_naive.cu

__global__ void convolution_kernel(float *output, float *input, float *filter) {

    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    int i, j;
    float sum = 0.0;

    if (y < image_height && x < image_width) {

        for (j = 0; j < filter_height; j++) {
            for (i = 0; i < filter_width; i++) {
                sum += input[(y + j) * input_width + (x + i)] * filter[j * filter_width + i];
            }
        }

        output[y * image_width + x] = sum;
    }
}

Implement a test

We will start with using Kernel Tuner’s run_kernel function to call our naive 2D convolution kernel. But first we will have to create some input data, which we will do as follows:

[ ]:
import numpy as np
from kernel_tuner import run_kernel

filter_size = (17, 17)
output_size = (4096, 4096)

size = np.prod(output_size)
border_size = (filter_size[0]//2*2, filter_size[1]//2*2)
input_size = ((output_size[0]+border_size[0]) * (output_size[1]+border_size[1]))

output_image = np.zeros(size).astype(np.float32)
input_image = np.random.randn(input_size).astype(np.float32)

conv_filter = np.random.randn(filter_size[0]*filter_size[1]).astype(np.float32)

Now that we have our input and output data structures created, we can look at how to run our naive kernel on this data, by calling run_kernel. The run_kernel function has the following signature:

run_kernel(kernel_name, kernel_source, problem_size, arguments, params, ...)

The ellipsis here indicate that there are many more optional arguments, which we won’t need right now. If you’re interested, the complete API documentation of run_kernel can be found here.

The five required arguments of run_kernel are: * kernel_name name of the kernel as a string * kernel_source string filename, or one or more strings with code or a code generator function * problem_size the size of the domain in up to three dimensions * arguments a list of arguments used to call the kernel * params a dictionary with the tunable parameters

The kernel_name is simply a string with the name of the kernel in the code. The kernel_source can be a string containing the code, or a filename. The first cell in this notebook wrote the kernel code to a file name “convolution_naive.cu”.

[ ]:
kernel_name = "convolution_kernel"
kernel_source = "convolution_naive.cu"

The problem_size is what is used by Kernel Tuner to determine the grid dimensions of the kernel. Our naive kernel needs one thread for each pixel in the output image. As defined above, our output_size is \(4096 \times 4096\).

Kernel Tuner computes the grid dimensions of a kernel by dividing the problem_size in each dimension with the grid divisors in that dimension. The grid divisors are, by default, simply the thread block dimensions. So for our naive kernel we do not need to specify any grid divisor lists at this point.

[ ]:
problem_size = output_size

The arguments is a list of arguments that are used to run the kernel on the GPU. arguments should be specified as a list of Numpy objects (arrays and/or scalars) that correspond with the function arguments of our CUDA kernel. Our naive convolution kernel has the following signature:

__global__ void convolution_kernel(float *output, float *input, float *filter) { }

Therefore, our list of Numpy objects should contain the output image, the input image, and the convolution filter, and exactly in that order, matching the type (32-bit floating-point arrays) and dimensions that are expected by the kernel.

[ ]:
arguments = [output_image, input_image, conv_filter]

The final required argument is params, which is a dictionary with the user-defined parameters of the kernel. Remember that the user, is you! You can specify anything here and Kernel Tuner will insert a C-preprocessor #define statement into the kernel with the value that you specify. For example, if you were to create a dictionary like so:

params = {"I_like_convolutions": 42}

Kernel Tuner will insert the following line into our naive convolution kernel:

#define I_like_convolutions 42

While we do like convolutions, this definition won’t have much effect on the performance of our kernel. Unless of course somewhere in our kernel we are doing something differently depending on the definition or the value of this preprocessor token.

In addition to freely defined parameters, there are a few special values. You may have noticed that we are about to call a CUDA kernel but we haven’t specified any thread block dimensions yet. When using Kernel Tuner, thread block dimensions are basically just parameters to the kernel. Therefore, the parameters with the names "block_size_x", "block_size_y", and "block_size_z" will be interpreted by Kernel Tuner as the thread block dimensions in x,y, and z.

Note that these are just the defaults, if you prefer to name your thread block dimensions differently, please use the block_size_names= option.

Let’s continue with the creation of our params dictionary such that we can run our naive convolution kernel. As thread block dimensions we will just select the trusty old \(16 \times 16\):

[ ]:
params = dict()
params["block_size_x"] = 16
params["block_size_y"] = 16

Finally, we specify a some input dimensions that are required by our kernel. As you may have noticed the kernel uses, currently undefined, constants, like image_height, image_width, filter_heigth, and filter_width. We also insert those values using the parameters feature of Kernel Tuner. Note that this is not required, we could also have specified these at runtime as arguments to the kernel.

[ ]:
params["image_height"] = output_size[1]
params["image_width"] = output_size[0]
params["filter_height"] = filter_size[1]
params["filter_width"] = filter_size[0]
params["input_width"] = output_size[0] + border_size[0]

Now we have setup everything that should allow us to call run_kernel:

[ ]:
answer = run_kernel(kernel_name, kernel_source, problem_size, arguments, params)
print("Done")

If you execute the above cell it will allocate GPU memory, move the contents of the arguments list to GPU memory, compile the kernel specified in kernel_source, and run the kernel name kernel_name with the thread block dimensions specified in params and the grid dimensions derived from the problem_size. After executing the kernel, run_kernel will also retrieve the results from GPU memory, and free GPU memory. The run_kernel function returns the data retrieved from the GPU in a list of Numpy arrays that we have named answer in the above example.

The answer list contains Numpy objects (arrays and/or scalars) in the same order and of the same type as the arguments list that we used to call the kernel with, but in contrast to arguments it contains the data that was stored in GPU memory after our naive convolution kernel had finished executing. This feature is particularly useful for implementing tests for your GPU kernels. You can perform the same operation in Python and compare the results.

Tuning 2D Convolution

In many cases there are more tunable parameters than just the thread block dimensions. We have included a highly-optimized 2D Convolution kernel that contains many parametrized code optimizations. It’s a bit long to include here, so instead we just point to the file, you may need to adjust the path a little bit depending on where you’ve stored the Kernel Tuner’s source code and where this notebook is executing.

[ ]:
filename = "../examples/cuda/convolution.cu"

Tuning a kernel with Kernel Tuner is done using the tune_kernel function. The interface should look familiar, because it’s exactly like run_kernel:

tune_kernel(kernel_name, kernel_string, problem_size, arguments, tune_params, ...)

The only difference is that the params dictionary is replaced by a tune_params dictionary that works similarly, but instead of a single value per parameter tune_params should contain a list of possible values for that parameter.

Again, the ellipsis indicate that there are many more optional arguments, but we won’t need those right now. If you’re interested, the complete API documentation of tune_kernel can be found here.

We could create a tune_params dictionary in the following way:

[ ]:
tune_params = dict()
tune_params["block_size_x"] = [16, 32, 64, 128]
tune_params["block_size_y"] = [8, 16]

Let’s just try that out and see what happens:

[ ]:
from kernel_tuner import tune_kernel
results, env = tune_kernel(kernel_name, filename, problem_size, arguments, tune_params)

As you can see, Kernel Tuner takes the Cartesian product of all lists in tune_params and benchmarks a kernel for each possible combination of values for all the tunable parameters. For such a small set of combinations benchmarking all of them is not really a problem. However, if there are a lot of tunable parameters with many different options this can get problematic. Therefore, Kernel Tuner supports many different optimization strategies, how to use these is explained the API documentation of tune_kernel.

Some combinations of values are illegal and will be skipped automatically. For example, using thread block dimensions of \(128 \times 16 = 2048\), which is more than the limit of 1024 that is currently the limit in all CUDA devices. Configurations that fail for other (to be expected) reasons like using too much shared memory, or requiring more registers than available on the device will also be skipped silently by Kernel Tuner, unless you specify “verbose=True” as an optional argument to tune_kernel. Note that other errors, like an out-of-bounds memory access will not be ignored.

The tune_kernel function returns two things. The first is the results, which is a list of records that show the execution time of each benchmarked kernel and the parameters used to compile and run that specific kernel configuration. Secondly, tune_kernel returns a dictionary that describes the environment in which the tuning experiment took place. That means all the inputs to tune_kernel are recorded, but also the software versions of your CUDA installation, OS and so on, along with GPU device information. This second dictionary can be stored along with the results so that you can always find out under what circumstances those results were obtained.

More tunable parameters

I promised that we would use more tunable parameters than just thread block dimensions. Our 2D Convolution kernel also also supports tiling factors in the x and y dimensions. Tiling factors indicate that the amount of work performed by the thread block in a particular dimension is increased with a certain factor.

[ ]:
tune_params["tile_size_x"] = [1, 2, 4]
tune_params["tile_size_y"] = [1, 2, 4]

It’s important to understand that if we increase the amount of work that is performed by every thread block, we also need fewer thread blocks, because the total amount of work stays the same. Remember that the Kernel Tuner computes the grid dimensions (the number of thread blocks the kernel is executed with) from the problem_size and the thread block dimensions.

So now we need to tell Kernel Tuner that we have a tunable parameter that influences the way that the grid dimensions are computed, for this we have the grid divisor lists. You may have noticed that we already have a tunable parameter that influences the grid dimensions, namely the thread block dimensions that we call “block_size_x” and “block_size_y”. We did not yet need to specify any grid divisor lists because Kernel Tuner is dividing the problem size by the thread block dimensions by default. However, if we are going to use grid divisor lists we need to specify all tunable parameters that divide the problem size in a certain dimension to obtain the grid size in that dimension.

So to mimick the default behavior that we have been assuming so far we would need to specify:

[ ]:
grid_div_x = ["block_size_x"]
grid_div_y = ["block_size_y"]

Now we should add the tiling factors to the grid divisor lists because, as the tiling factor is increased, the number of thread blocks in that dimension should be decreased correspondingly.

[ ]:
grid_div_x = ["block_size_x", "tile_size_x"]
grid_div_y = ["block_size_y", "tile_size_y"]

Before we continue with calling tune_kernel we’ll show how to make Kernel Tuner display the performance of our kernel using the commonly used performance metric GFLOP/s (giga floating-point operations per second). We can specify how Kernel Tuner should compute user-defined metrics by using the metrics option. Metrics should be specified using an ordered dictionary, because metrics are composable. We can define metrics as lambda functions that take one argument, a dictionary with the tunable parameters and benchmark results of the kernel configuration.

[ ]:
from collections import OrderedDict
metrics = OrderedDict()
metrics["GFLOP/s"] = lambda p : np.prod((2,)+output_size+filter_size)/1e9 / (p["time"]/1e3)

Now we are ready to call tune_kernel again with our expanded search space. Note that this may take a bit longer since we have just increased our parameter space with a factor of 9.

[ ]:
results, env = tune_kernel(kernel_name, filename, problem_size, arguments, tune_params,
                           grid_div_x=grid_div_x, grid_div_y=grid_div_y, metrics=metrics)

And that’s it for this guide! We have seen how to call run_kernel and tune_kernel for a 2D Convolution kernel using different thread block dimensions and other tunable parameters. You now know enough to be able to start tuning your own CUDA and/or OpenCL kernels!