Kernel Launcher

Kernel Launcher is a C++ library designed to dynamically compile CUDA kernels at runtime (using NVRTC) and to launch them in a type-safe manner using C++ magic. Runtime compilation offers two significant advantages:

Kernels that have tunable parameters (block size, elements per thread, loop unroll factors, etc.) where the optimal configuration depends on dynamic factors such as the GPU type and problem size.
Improve performance by injecting runtime values as compile-time constant values into kernel code (dimensions, array strides, weights, etc.).

Kernel Tuner Integration

The tight integration of Kernel Launcher with Kernel Tuner ensures that kernels are highly optimized, as illustrated in the image above. Kernel Launcher can capture kernel launches within your application at runtime. These captured kernels can then be tuned by Kernel Tuner and the tuning results are saved as wisdom files. These wisdom files are used by Kernel Launcher during execution to compile the tuned kernel at runtime.

See Wisdom Files for an example of how this works in practise.

Basic Example

This section presents a simple code example illustrating how to use the Kernel Launcher. For a more detailed example, refer to Guides.

Consider the following CUDA kernel for vector addition. This kernel has a template parameter T and a tunable parameter ELEMENTS_PER_THREAD.

template <typename T>
__global__
void vector_add(int n, T* C, const T* A, const T* B) {
    for (int k = 0; k < ELEMENTS_PER_THREAD; k++) {
        int i = blockIdx.x * ELEMENTS_PER_THREAD * blockDim.x + k * blockDim.x + threadIdx.x;

        if (i < n) {
            C[i] = A[i] + B[i];
        }
    }
}

The following C++ snippet demonstrates how to use the Kernel Launcher in the host code:

#include "kernel_launcher.h"

int main() {
    // Namespace alias.
    namespace kl = kernel_launcher;

    // Create a kernel builder
    kl::KernelBuilder builder("vector_add", "vector_add_kernel.cu");

    // Define the variables that can be tuned for this kernel.
    kl::ParamExpr threads_per_block = builder.tune("block_size", {32, 64, 128, 256, 512, 1024});
    kl::ParamExpr elements_per_thread = builder.tune("elements_per_thread", {1, 2, 4, 8});

    // Set kernel properties such as block size, grid divisor, template arguments, etc.
    builder
        .problem_size(kl::arg0)
        .block_size(threads_per_block)
        .grid_divisors(threads_per_block * elements_per_thread)
        .template_args(kl::type_of<float>())
        .define("ELEMENTS_PER_THREAD", elements_per_thread);

    // Define the kernel
    kl::WisdomKernel vector_add_kernel(builder);

    // Initialize CUDA memory. This is outside the scope of kernel_launcher.
    unsigned int n = 1000000;
    float *dev_A, *dev_B, *dev_C;
    /* cudaMalloc, cudaMemcpy, ... */

    // Launch the kernel! Note that kernel is compiled on the first call.
    // The grid size and block size do not need to be specified, they are
    // derived from the kernel specifications.
    vector_add_kernel(n, dev_C, dev_A, dev_B);
}

Kernel Launcher

Kernel Tuner Integration

Basic Example

Indices and Tables