Kernel Tuner Examples¶

Most of the examples show how to use Kernel Tuner to tune a CUDA, OpenCL, or C kernel, while demonstrating a particular usecase of Kernel Tuner.

Except for test_vector_add.py and test_vector_add_parameterized.py, which show how to write tests for GPU kernels with Kernel Tuner.

Note

Please do not use the examples as performance benchmarks. The examples here are created specifically to highlight certain features in Kernel Tuner. Please contact the developers if you are interested in benchmarking Kernel Tuner.

Below we list the example applications and the features they illustrate.

Vector Add¶

[CUDA] [CUDA-C++] [OpenCL] [C] [Fortran] [OpenACC-C++] [OpenACC-Fortran]

use Kernel Tuner to tune a simple kernel

Stencil¶

[CUDA] [OpenCL]

use a 2-dimensional problem domain with 2-dimensional thread blocks in a simple and clean example

Matrix Multiplication¶

[CUDA] [OpenCL]

pass a filename instead of a string with code
use 2-dimensional thread blocks and tiling in both dimensions
tell Kernel Tuner to compute the grid dimensions for 2D thread blocks with tiling
use the restrictions option to limit the search to only valid configurations
use a user-defined performance metric like GFLOP/s

Convolution¶

There are several different examples centered around the convolution kernel [CUDA] [OpenCL]

convolution.py¶

[CUDA] [OpenCL]

use tunable parameters for tuning for multiple input sizes
pass constant memory arguments to the kernel
write output to a json file

sepconv.py¶

[CUDA] [OpenCL]

use the convolution kernel for separable filters
write output to a csv file using Pandas

convolution_correct.py¶

[CUDA] [OpenCL]

use run_kernel to compute a reference answer
verify the output of every benchmarked kernel

convolution_streams.py¶

[CUDA]

allocate page-locked host memory from Python
overlap transfers to and from the GPU with computation
tune parameters in the host code in combination with those in the kernel
use the lang=”C” option and set compiler options
pass a list of filenames instead of strings with kernel code

Reduction¶

[CUDA] [OpenCL]

use vector types and shuffle instructions (shuffle is only available in CUDA)
tune the number of thread blocks the kernel is executed with
tune the partial loop unrolling factor of a for-loop
tune pipeline that consists of two kernels
tune with custom output verification function

Sparse Matrix Vector Multiplication¶

[CUDA]

use scipy to compute a reference answer and verify all benchmarked kernels
express that the number of thread blocks depends on the values of tunable parameters

Point-in-Polygon¶

[CUDA]

overlap transfers with device mapped host memory
tune on different implementations of an algorithm

ExpDist¶

[CUDA]

in-thread block 2D reduction using CUB library
C++ in CUDA kernel code
tune multiple kernels in pipeline

Code Generator¶

[CUDA] [OpenCL]

use a Python function as a code generator