Kernel Registry

In the previous example, we saw how to use wisdom files by creating a WisdomKernel object. This object will compile the kernel code on the first call and then keep the kernel loaded as long as the object exists. Typically, one would define the WisdomKernel object as part of a class or as a global variable.

However, in certain scenarios, it is inconvenient or impractical to store WisdomKernel objects. In these cases, it is possible to use the KernelRegistry that essentially acts like a global table of compiled kernel instances.

Source code

Consider the following code snippet:

 1#include "kernel_launcher.h"
 2
 3// Namespace alias.
 4namespace kl = kernel_launcher;
 5
 6class VectorAddDescriptor: kl::IKernelDescriptor {
 7public:
 8    template <typename T>
 9    static VectorAddDescriptor for_type() {
10        return VectorAddDescriptor(kl::type_of<T>());
11    }
12
13    VectorAddDescriptor(kl::TypeInfo t): element_type(t) {}
14
15    kl::KernelBuilder build() const override {
16        kl::KernelBuilder builder("vector_add", "vector_add.cu");
17
18        auto threads_per_block = builder.tune("block_size", {32, 64, 128, 256, 512, 1024});
19        auto elements_per_thread = builder.tune("elements_per_thread", {1, 2, 4, 8});
20        auto elements_per_block = threads_per_block * elements_per_thread;
21
22        builder
23            .tuning_key("vector_add_" + this->element_type.name())
24            .problem_size(kl::arg0)
25            .block_size(threads_per_block)
26            .grid_divisors(threads_per_block * elements_per_thread)
27            .template_args(element_type)
28            .define("ELEMENTS_PER_THREAD", elements_per_thread);
29
30        return builder;
31    }
32
33    bool equals(const IKernelDescriptor& other) const override {
34        if (auto p = dynamic_cast<const VectorAddDescriptor*>(&other)) {
35            return this->element_type == p->element_type;
36        }
37
38        return false;
39    }
40
41    private:
42        kl::TypeInfo element_type;
43};
44
45int main() {
46    kl::set_global_wisdom_directory("wisdom/");
47    kl::set_global_capture_directory("captures/");
48
49    // Initialize CUDA memory. This is outside the scope of kernel_launcher.
50    unsigned int n = 1000000;
51    float *dev_A, *dev_B, *dev_C;
52    /* cudaMalloc, cudaMemcpy, ... */
53
54    // Launch the kernel!
55    kl::default_registry()
56        .lookup(VectorAddDescriptor::for_type<float>())
57        .launch(n, dev_C, dev_A, dev_B);
58
59    // Or use the short equivalent syntax:
60    kl::launch(VectorAddDescriptor::for_type<float>(), n, dev_C, dev_A, dev_B);
61
62    return 0;
63}

Code Explanation

The code example consists of two parts. In the first part, a class VectorAddDescriptor is defined. In the second part, this class is searched in the global kernel registry.

Defining a kernel descriptor

 6class VectorAddDescriptor: kl::IKernelDescriptor {
 7public:
 8    template <typename T>
 9    static VectorAddDescriptor for_type() {
10        return VectorAddDescriptor(kl::type_of<T>());
11    }
12
13    VectorAddDescriptor(kl::TypeInfo t): element_type(t) {}
14
15    kl::KernelBuilder build() const override {
16        kl::KernelBuilder builder("vector_add", "vector_add.cu");
17
18        auto threads_per_block = builder.tune("block_size", {32, 64, 128, 256, 512, 1024});
19        auto elements_per_thread = builder.tune("elements_per_thread", {1, 2, 4, 8});
20        auto elements_per_block = threads_per_block * elements_per_thread;
21
22        builder
23            .tuning_key("vector_add_" + this->element_type.name())
24            .problem_size(kl::arg0)
25            .block_size(threads_per_block)
26            .grid_divisors(threads_per_block * elements_per_thread)
27            .template_args(element_type)
28            .define("ELEMENTS_PER_THREAD", elements_per_thread);
29
30        return builder;
31    }
32
33    bool equals(const IKernelDescriptor& other) const override {
34        if (auto p = dynamic_cast<const VectorAddDescriptor*>(&other)) {
35            return this->element_type == p->element_type;
36        }
37
38        return false;
39    }
40
41    private:
42        kl::TypeInfo element_type;
43};

This part of the code defines an IKernelDescriptor: a class that encapsulates the information required to compile a kernel. This class should override two methods:

  • build to instantiate a KernelBuilder,

  • equals to check for equality with another IKernelDescriptor.

The last method is required since a kernel registry is essentially a hash table that maps IKernelDescriptor objects to kernel objects. The equals method is used to check if two descriptors (i.e., keys in the hash table) are equivalent.

Using the KernelRegistry

54    // Launch the kernel!
55    kl::default_registry()
56        .lookup(VectorAddDescriptor::for_type<float>())
57        .launch(n, dev_C, dev_A, dev_B);

Here, the vector-add kernel is searched in the registry and launched with the given arguments. It is important to note that this code can be called multiple times from different functions of a program, but the kernel is only compiled once and stored in the registry.

59    // Or use the short equivalent syntax:
60    kl::launch(VectorAddDescriptor::for_type<float>(), n, dev_C, dev_A, dev_B);

Alternatively, it is possible to use the above short-hand syntax. This syntax also makes it easy to replace the element type float with some other type such as int:

kl::launch(VectorAddDescriptor::for_type<int>(), n, dev_C, dev_A, dev_B);

It is even possible to define a templated function that passes type T on to VectorAddDescriptor, for some extra template magic:

1template <typename T>
2void launch_vector_add(T* C, const T* A, const T* B) {
3    kl::launch(VectorAddDescriptor::for_type<T>(), n, C, A, B);
4}

Instead of using the global kernel registry, it is also possible to create a local registry by creating a KernelRegistry instance.