Kernel Registry
In the previous example, we saw how to use wisdom files by creating a WisdomKernel
object.
This object will compile the kernel code on the first call and then keep the kernel loaded as long as the object exists.
Typically, one would define the WisdomKernel
object as part of a class or as a global variable.
However, in certain scenarios, it is inconvenient or impractical to store WisdomKernel
objects.
In these cases, it is possible to use the KernelRegistry
that essentially acts like a global table of compiled kernel instances.
Source code
Consider the following code snippet:
1#include "kernel_launcher.h"
2
3// Namespace alias.
4namespace kl = kernel_launcher;
5
6class VectorAddDescriptor: kl::IKernelDescriptor {
7public:
8 template <typename T>
9 static VectorAddDescriptor for_type() {
10 return VectorAddDescriptor(kl::type_of<T>());
11 }
12
13 VectorAddDescriptor(kl::TypeInfo t): element_type(t) {}
14
15 kl::KernelBuilder build() const override {
16 kl::KernelBuilder builder("vector_add", "vector_add.cu");
17
18 auto threads_per_block = builder.tune("block_size", {32, 64, 128, 256, 512, 1024});
19 auto elements_per_thread = builder.tune("elements_per_thread", {1, 2, 4, 8});
20 auto elements_per_block = threads_per_block * elements_per_thread;
21
22 builder
23 .tuning_key("vector_add_" + this->element_type.name())
24 .problem_size(kl::arg0)
25 .block_size(threads_per_block)
26 .grid_divisors(threads_per_block * elements_per_thread)
27 .template_args(element_type)
28 .define("ELEMENTS_PER_THREAD", elements_per_thread);
29
30 return builder;
31 }
32
33 bool equals(const IKernelDescriptor& other) const override {
34 if (auto p = dynamic_cast<const VectorAddDescriptor*>(&other)) {
35 return this->element_type == p->element_type;
36 }
37
38 return false;
39 }
40
41 private:
42 kl::TypeInfo element_type;
43};
44
45int main() {
46 kl::set_global_wisdom_directory("wisdom/");
47 kl::set_global_capture_directory("captures/");
48
49 // Initialize CUDA memory. This is outside the scope of kernel_launcher.
50 unsigned int n = 1000000;
51 float *dev_A, *dev_B, *dev_C;
52 /* cudaMalloc, cudaMemcpy, ... */
53
54 // Launch the kernel!
55 kl::default_registry()
56 .lookup(VectorAddDescriptor::for_type<float>())
57 .launch(n, dev_C, dev_A, dev_B);
58
59 // Or use the short equivalent syntax:
60 kl::launch(VectorAddDescriptor::for_type<float>(), n, dev_C, dev_A, dev_B);
61
62 return 0;
63}
Code Explanation
The code example consists of two parts.
In the first part, a class VectorAddDescriptor
is defined.
In the second part, this class is searched in the global kernel registry.
Defining a kernel descriptor
6class VectorAddDescriptor: kl::IKernelDescriptor {
7public:
8 template <typename T>
9 static VectorAddDescriptor for_type() {
10 return VectorAddDescriptor(kl::type_of<T>());
11 }
12
13 VectorAddDescriptor(kl::TypeInfo t): element_type(t) {}
14
15 kl::KernelBuilder build() const override {
16 kl::KernelBuilder builder("vector_add", "vector_add.cu");
17
18 auto threads_per_block = builder.tune("block_size", {32, 64, 128, 256, 512, 1024});
19 auto elements_per_thread = builder.tune("elements_per_thread", {1, 2, 4, 8});
20 auto elements_per_block = threads_per_block * elements_per_thread;
21
22 builder
23 .tuning_key("vector_add_" + this->element_type.name())
24 .problem_size(kl::arg0)
25 .block_size(threads_per_block)
26 .grid_divisors(threads_per_block * elements_per_thread)
27 .template_args(element_type)
28 .define("ELEMENTS_PER_THREAD", elements_per_thread);
29
30 return builder;
31 }
32
33 bool equals(const IKernelDescriptor& other) const override {
34 if (auto p = dynamic_cast<const VectorAddDescriptor*>(&other)) {
35 return this->element_type == p->element_type;
36 }
37
38 return false;
39 }
40
41 private:
42 kl::TypeInfo element_type;
43};
This part of the code defines an IKernelDescriptor
:
a class that encapsulates the information required to compile a kernel.
This class should override two methods:
build
to instantiate aKernelBuilder
,equals
to check for equality with anotherIKernelDescriptor
.
The last method is required since a kernel registry is essentially a hash table that maps IKernelDescriptor
objects to kernel objects.
The equals
method is used to check if two descriptors (i.e., keys in the hash table) are equivalent.
Using the KernelRegistry
54 // Launch the kernel!
55 kl::default_registry()
56 .lookup(VectorAddDescriptor::for_type<float>())
57 .launch(n, dev_C, dev_A, dev_B);
Here, the vector-add kernel is searched in the registry and launched with the given arguments. It is important to note that this code can be called multiple times from different functions of a program, but the kernel is only compiled once and stored in the registry.
59 // Or use the short equivalent syntax:
60 kl::launch(VectorAddDescriptor::for_type<float>(), n, dev_C, dev_A, dev_B);
Alternatively, it is possible to use the above short-hand syntax.
This syntax also makes it easy to replace the element type float
with some other type such as int
:
kl::launch(VectorAddDescriptor::for_type<int>(), n, dev_C, dev_A, dev_B);
It is even possible to define a templated function that passes type T
on to VectorAddDescriptor
, for some extra template magic:
1template <typename T>
2void launch_vector_add(T* C, const T* A, const T* B) {
3 kl::launch(VectorAddDescriptor::for_type<T>(), n, C, A, B);
4}
Instead of using the global kernel registry, it is also possible to create a local registry by creating a KernelRegistry
instance.