Kernel Tuner Examples

Most of the examples show how to use Kernel Tuner to tune a CUDA, OpenCL, or C kernel, while demonstrating a particular usecase of Kernel Tuner.

Except for test_vector_add.py and test_vector_add_parameterized.py, which show how to write tests for GPU kernels with Kernel Tuner.

Note

Please do not use the examples as performance benchmarks. The examples here are created specifically to highlight certain features in Kernel Tuner. Please contact the developers if you are interested in benchmarking Kernel Tuner.

Below we list the example applications and the features they illustrate.

Vector Add

[CUDA] [CUDA-C++] [OpenCL] [C] [Fortran]
  • use Kernel Tuner to tune a simple kernel

Stencil

[CUDA] [OpenCL]
  • use a 2-dimensional problem domain with 2-dimensional thread blocks in a simple and clean example

Matrix Multiplication

[CUDA] [OpenCL]
  • pass a filename instead of a string with code

  • use 2-dimensional thread blocks and tiling in both dimensions

  • tell Kernel Tuner to compute the grid dimensions for 2D thread blocks with tiling

  • use the restrictions option to limit the search to only valid configurations

  • use a user-defined performance metric like GFLOP/s

Convolution

There are several different examples centered around the convolution kernel [CUDA] [OpenCL]

convolution.py

[CUDA] [OpenCL]
  • use tunable parameters for tuning for multiple input sizes

  • pass constant memory arguments to the kernel

  • write output to a json file

sepconv.py

[CUDA] [OpenCL]
  • use the convolution kernel for separable filters

  • write output to a csv file using Pandas

convolution_correct.py

[CUDA] [OpenCL]
  • use run_kernel to compute a reference answer

  • verify the output of every benchmarked kernel

convolution_streams.py

[CUDA]
  • allocate page-locked host memory from Python

  • overlap transfers to and from the GPU with computation

  • tune parameters in the host code in combination with those in the kernel

  • use the lang=”C” option and set compiler options

  • pass a list of filenames instead of strings with kernel code

Reduction

[CUDA] [OpenCL]
  • use vector types and shuffle instructions (shuffle is only available in CUDA)

  • tune the number of thread blocks the kernel is executed with

  • tune the partial loop unrolling factor of a for-loop

  • tune pipeline that consists of two kernels

  • tune with custom output verification function

Sparse Matrix Vector Multiplication

[CUDA]
  • use scipy to compute a reference answer and verify all benchmarked kernels

  • express that the number of thread blocks depends on the values of tunable parameters

Point-in-Polygon

[CUDA]
  • overlap transfers with device mapped host memory

  • tune on different implementations of an algorithm

ExpDist

[CUDA]
  • in-thread block 2D reduction using CUB library

  • C++ in CUDA kernel code

  • tune multiple kernels in pipeline

Code Generator

[CUDA] [OpenCL]
  • use a Python function as a code generator