Wisdom Files

In the previous example, we demonstrated how to compile a kernel by providing both a KernelBuilder instance (describing the blueprint for the kernel) and a Config instance (describing the configuration of the tunable parameters).

However, determining the optimal configuration can often be challenging, as it depends on both the problem size and the specific type of GPU being used. To address this problem, Kernel Launcher provides a solution in the form of wisdom files (terminology borrowed from FFTW).

To use the Kernel Launcher’s wisdom files, we need to run the application twice. First, we capture the kernels that we want to tune, and then we use Kernel Tuner to tune those kernels. Second, when we run the application again, but this time the kernel configuration is selected from the wisdom file that was generated during the tuning process.

Let’s see this in action.

Source code

vector_add_kernel.cu

 1template <typename T>
 2__global__
 3void vector_add(int n, T* C, const T* A, const T* B) {
 4    for (int k = 0; k < ELEMENTS_PER_THREAD; k++) {
 5        int i = blockIdx.x * ELEMENTS_PER_THREAD * blockDim.x + k * blockDim.x + threadIdx.x;
 6
 7        if (i < n) {
 8            C[i] = A[i] + B[i];
 9        }
10    }
11}

main.cpp

 1#include "kernel_launcher.h"
 2
 3// Namespace alias.
 4namespace kl = kernel_launcher;
 5
 6kl::KernelBuilder build_kernel() {
 7    kl::KernelBuilder builder("vector_add", "vector_add.cu");
 8
 9    auto threads_per_block = builder.tune("block_size", {32, 64, 128, 256, 512, 1024});
10    auto elements_per_thread = builder.tune("elements_per_thread", {1, 2, 4, 8});
11    auto elements_per_block = threads_per_block * elements_per_thread;
12
13    builder
14        .tuning_key("vector_add_float")
15        .problem_size(kl::arg0)
16        .block_size(threads_per_block)
17        .grid_divisors(threads_per_block * elements_per_thread)
18        .template_args(kl::type_of<float>())
19        .define("ELEMENTS_PER_THREAD", elements_per_thread);
20
21    return builder;
22}
23
24int main() {
25    kl::set_global_wisdom_directory("wisdom/");
26    kl::set_global_capture_directory("captures/");
27
28    // Define the kernel.
29    kl::KernelBuilder builder = build_kernel();
30    kl::WisdomKernel vector_add_kernel(builder);
31
32    // Initialize CUDA memory. This is outside the scope of kernel_launcher.
33    unsigned int n = 1000000;
34    float *dev_A, *dev_B, *dev_C;
35    /* cudaMalloc, cudaMemcpy, ... */
36
37    // Launch the kernel!
38    unsigned int problem_size = n;
39    vector_add_kernel(n, dev_C, dev_A, dev_B);
40    return 0;
41}

Code Explanation

Notice how this example is similar to the previous example, with some minor differences such that kl::Kernel has been replaced by kl::WisdomKernel. We now highlight the important lines of this code example.

 6kl::KernelBuilder build_kernel() {
 7    kl::KernelBuilder builder("vector_add", "vector_add.cu");
 8
 9    auto threads_per_block = builder.tune("block_size", {32, 64, 128, 256, 512, 1024});
10    auto elements_per_thread = builder.tune("elements_per_thread", {1, 2, 4, 8});
11    auto elements_per_block = threads_per_block * elements_per_thread;
12
13    builder
14        .tuning_key("vector_add_float")
15        .problem_size(kl::arg0)
16        .block_size(threads_per_block)
17        .grid_divisors(threads_per_block * elements_per_thread)
18        .template_args(kl::type_of<float>())
19        .define("ELEMENTS_PER_THREAD", elements_per_thread);
20
21    return builder;
22}

This function creates a KernelBuilder object.

13    builder
14        .tuning_key("vector_add_float")

When using a WisdomKernel, we need to set the tuning key. If no tuning key is specified, the default kernel name is used (in this case, vector_add) The tuning key is a string that uniquely identifies the kernel and is used to locate the corresponding wisdom file. If no wisdom file can be found, the default configuration is used (in this example, block_size=32 and elements_per_thread=1).

25    kl::set_global_wisdom_directory("wisdom/");
26    kl::set_global_capture_directory("captures/");

The following two lines of code set global configuration for the application.

The function set_global_wisdom_directory sets the directory where Kernel Launcher will search for wisdom files associated with a compiled kernel. In this example, the directory wisdom/ is set as the wisdom directory, and Kernel Launcher will search for the file wisdom/vector_add_float.wisdom since vector_add_float is the tuning key.

The function set_global_capture_directory sets the directory where Kernel Launcher will store resulting files when capturing a kernel launch.

28    // Define the kernel.
29    kl::KernelBuilder builder = build_kernel();
30    kl::WisdomKernel vector_add_kernel(builder);

These lines construct the KernelBuilder and pass it on to the WisdomKernel.

Export the kernel

In order to tune the kernel, the first step is to capture the kernel launch. To do so, we need to run the program with the environment variable KERNEL_LAUNCHER_CAPTURE set to the name of the kernel we want to capture:

$ KERNEL_LAUNCHER_CAPTURE=vector_add ./main

This generates a file called vector_add_1000000.json in the directory set by set_global_capture_directory.

Alternatively, it is possible to capture several kernels at once by using the wildcard *. For example, the following command exports all kernels that start with vector_:

$ KERNEL_LAUNCHER_CAPTURE=vector_* ./main

See Environment Variables for an overview and description of additional environment variables.

Tune the kernel

To tune the kernel, run the Python script tune.py in the directory python/ which uses Kernel Tuner to tune the kernel. To view all available options, use --help. For example, to spend 10 minutes tuning the kernel for the current GPU, use the following command:

$ python tune.py captures/vector_add_1000000.json --output wisdom/ --time 10:00

To tune multiple kernels at once, use a wildcard:

$ python tune.py captures/*.json --output wisdom/

If everything goes well, the script should run for ten minutes and eventually generate a file wisdom/vector_add_float.wisdom containing the tuning results. Note that it is possible to tune the same kernel for different GPUs and problem sizes, and all results will be saved in the same wisdom file. After tuning, the files in the captures/ directory can be removed safely.

Import the wisdom

To use the wisdom file, make sure that the file wisdom/vector_add_float.wisdom is available and simply rerun the application. Now, when the program calls the vector_add_kernel function, Kernel Launcher finds the wisdom file and compiles the kernel given the optimal configuration. You can check the debugging output to verify that the wisdom file has been found by defining the environment variable KERNEL_LAUNCHER_LOG=debug:

$ KERNEL_LAUNCHER_LOG=debug ./main

KERNEL_LAUNCHER [DEBUG] reading wisdom file wisdom/vector_add_float.wisdom
KERNEL_LAUNCHER [DEBUG] found configuration for kernel vector_add, device NVIDIA A100-PCIE-40GB, problem size (1000000, 1, 1): {"block_size": 128, "elements_per_thread": 4}
KERNEL_LAUNCHER [DEBUG] compiling kernel (vector_add.cu)