Basic Example

On this page, we show a basic example of how to use Kernel Launcher. We first show the full source code and then go over the example line by line.

Source Code

vector_add_kernel.cu

 1template <typename T>
 2__global__
 3void vector_add(int n, T* C, const T* A, const T* B) {
 4    for (int k = 0; k < ELEMENTS_PER_THREAD; k++) {
 5        int i = blockIdx.x * ELEMENTS_PER_THREAD * blockDim.x + k * blockDim.x + threadIdx.x;
 6
 7        if (i < n) {
 8            C[i] = A[i] + B[i];
 9        }
10    }
11}

main.cpp

 1#include "kernel_launcher.h"
 2
 3
 4int main() {
 5    // Namespace alias.
 6    namespace kl = kernel_launcher;
 7    
 8    // Create a kernel builder
 9    kl::KernelBuilder builder("vector_add", "vector_add_kernel.cu");
10    
11    // Define tunable parameters 
12    auto threads_per_block = builder.tune("block_size", {32, 64, 128, 256, 512, 1024});
13    auto elements_per_thread = builder.tune("elements_per_thread", {1, 2, 4, 8});
14    
15    // Define expressions
16    auto elements_per_block = threads_per_block * elements_per_thread;
17    
18    // Define kernel properties
19    builder
20        .problem_size(kl::arg0)
21        .block_size(threads_per_block)
22        .grid_divisors(threads_per_block * elements_per_thread)
23        .template_args(kl::type_of<float>())
24        .define("ELEMENTS_PER_THREAD", elements_per_thread);
25
26    // Define configuration
27    kl::Config config;
28    config.insert(threads_per_block, 32);
29    config.insert(elements_per_thread, 2);
30
31    // Compile kernel
32    kl::Kernel<int, float*, const float*, const float*> vector_add_kernel;
33    vector_add_kernel.compile(builder, config);
34    
35    // Initialize CUDA memory. This is outside the scope of kernel_launcher.
36    int n = 1000000;
37    float *dev_A, *dev_B, *dev_C;
38    /* cudaMalloc, cudaMemcpy, ... */
39        
40    // Launch the kernel!
41    vector_add_kernel.launch(n, dev_C, dev_A, dev_B);
42}

Code Explanation

8    // Create a kernel builder
9    kl::KernelBuilder builder("vector_add", "vector_add_kernel.cu");

First, we need to define a KernelBuilder instance. A KernelBuilder is essentially a blueprint that describes the information required to compile the CUDA kernel. The constructor takes the name of the kernel function and the .cu file where the code is located. Optionally, we can also provide the kernel source as the third parameter.

11    // Define tunable parameters 
12    auto threads_per_block = builder.tune("block_size", {32, 64, 128, 256, 512, 1024});
13    auto elements_per_thread = builder.tune("elements_per_thread", {1, 2, 4, 8});

CUDA kernels often have tunable parameters that can impact their performance, such as block size, thread granularity, register usage, and the use of shared memory. Here, we define two tunable parameters: the number of threads per block and the number of elements processed per thread.

15    // Define expressions
16    auto elements_per_block = threads_per_block * elements_per_thread;

The values returned by tune are placeholder objects. These objects can be combined using C++ operators to create new expressions objects. Note that elements_per_block does not actually contain a specific value; instead, it is an abstract expression that, upon kernel instantiation, is evaluated as the product of threads_per_block and elements_per_thread.

18    // Define kernel properties
19    builder
20        .problem_size(kl::arg0)
21        .block_size(threads_per_block)
22        .grid_divisors(threads_per_block * elements_per_thread)
23        .template_args(kl::type_of<float>())
24        .define("ELEMENTS_PER_THREAD", elements_per_thread);

Next, we define properties of the kernel such as block size and template arguments. These properties can take on expressions, as demonstrated above. The full list of properties is documented as api/KernelBuilder The following properties are supported:

  • problem_size: This is an N-dimensional vector that represents the size of the problem. In this case, is one-dimensional and kl::arg0 means that the size is specified as the first kernel argument (argument 0).

  • block_size: A triplet (x, y, z) representing the block dimensions.

  • grid_divisor: This property is used to calculate the size of the grid (i.e., the number of blocks along each axis). For each kernel launch, the problem size is divided by the divisors to calculate the grid size. In other words, this property expresses the number of elements processed per thread block.

  • template_args: This property specifies template arguments, which can be type names and integral values.

  • define: Define preprocessor constants.

  • shared_memory: Specify the amount of shared memory required, in bytes.

  • compiler_flags: Additional flags passed to the compiler.

26    // Define configuration
27    kl::Config config;
28    config.insert(threads_per_block, 32);
29    config.insert(elements_per_thread, 2);

The configuration defines the values of the tunable parameters to be used for compilation. Here, the Config instance is constructed manually, but it could also be loaded from file or a tuning database.

31    // Compile kernel
32    kl::Kernel<int, float*, const float*, const float*> vector_add_kernel;
33    vector_add_kernel.compile(builder, config);

Compiling a Kernel requires a KernelBuilder together with a Config. The Kernel instance should be stored, for example, in a class and only compiled once during initialization.

39        
40    // Launch the kernel!
41    vector_add_kernel.launch(n, dev_C, dev_A, dev_B);
42}

To launch the kernel, we simply call launch.

Alternatively, it is also possible to use the short-hand form:

1// Launch the kernel!
2vector_add_kernel(n, dev_C, dev_A, dev_B);

To pass a CUDA stream use:

1// Launch the kernel!
2vector_add_kernel(stream, n, dev_C, dev_A, dev_B);

For 2D or 3D problems, we must configure the KernelBuilder with additional dimensions:

1// Define kernel properties
2builder.problem_size(kl::arg0, kl::arg1, 100);