Kernel Tuner Examples¶
Most of the examples show how to use Kernel Tuner to tune a CUDA, OpenCL, or C kernel, while demonstrating a particular usecase of Kernel Tuner.
Except for test_vector_add.py and test_vector_add_parameterized.py, which show how to write tests for GPU kernels with Kernel Tuner.
Note
Please do not use the examples as performance benchmarks. The examples here are created specifically to highlight certain features in Kernel Tuner. Please contact the developers if you are interested in benchmarking Kernel Tuner.
Below we list the example applications and the features they illustrate.
Vector Add¶
Stencil¶
Matrix Multiplication¶
- [CUDA] [OpenCL]
pass a filename instead of a string with code
use 2-dimensional thread blocks and tiling in both dimensions
tell Kernel Tuner to compute the grid dimensions for 2D thread blocks with tiling
use the restrictions option to limit the search to only valid configurations
use a user-defined performance metric like GFLOP/s
Convolution¶
There are several different examples centered around the convolution kernel [CUDA] [OpenCL]
convolution.py¶
sepconv.py¶
convolution_correct.py¶
convolution_streams.py¶
- [CUDA]
allocate page-locked host memory from Python
overlap transfers to and from the GPU with computation
tune parameters in the host code in combination with those in the kernel
use the lang=”C” option and set compiler options
pass a list of filenames instead of strings with kernel code
Reduction¶
Sparse Matrix Vector Multiplication¶
- [CUDA]
use scipy to compute a reference answer and verify all benchmarked kernels
express that the number of thread blocks depends on the values of tunable parameters
Point-in-Polygon¶
- [CUDA]
overlap transfers with device mapped host memory
tune on different implementations of an algorithm
ExpDist¶
- [CUDA]
in-thread block 2D reduction using CUB library
C++ in CUDA kernel code
tune multiple kernels in pipeline