Pragma Kernels
In the previous examples, we demonstrated how a tunable kernel can be specified by defining a KernelBuilder
instance in the host-side code.
While this API offers flexibility, it can be cumbersome and requires keeping the kernel code in CUDA in sync with the host-side code in C++.
Kernel Launcher also provides a way to define kernel specifications directly in the CUDA code by using pragma directives to annotate the kernel code.
Although this method is less flexible than the KernelBuilder
API, it is much more convenient and suitable for most CUDA kernels.
Source Code
The following code example shows valid CUDA kernel code containing pragma directives.
The #pragma
annotations will be ignored by the nvcc
compiler (but they may produce compiler warnings).
1#pragma kernel tune(threads_per_block=32, 64, 128, 256, 512, 1024)
2#pragma kernel tune(items_per_thread=1, 2, 4, 8)
3#pragma kernel set(items_per_block=threads_per_block * items_per_thread)
4#pragma kernel problem_size(n)
5#pragma kernel block_size(threads_per_block)
6#pragma kernel grid_divisor(items_per_block)
7#pragma kernel buffers(C[n], A[n], B[n])
8#pragma kernel tuning_key("vector_add_" + T)
9template <typename T, int items_per_thread=1>
10__global__
11void vector_add(int n, T* C, const T* A, const T* B) {
12 for (int k = 0; k < items_per_thread; k++) {
13 int i = blockIdx.x * items_per_thread * blockDim.x + k * blockDim.x + threadIdx.x;
14
15 if (i < n) {
16 C[i] = A[i] + B[i];
17 }
18 }
19}
Code Explanation
The kernel contains the following pragma
directives:
1#pragma kernel tune(threads_per_block=32, 64, 128, 256, 512, 1024)
2#pragma kernel tune(items_per_thread=1, 2, 4, 8)
The tune directives specify the tunable parameters: threads_per_block
and items_per_thread
.
Since items_per_thread
is also the name of the template parameter, it is passed to the kernel as a compile-time constant via this parameter.
The value of threads_per_block
is not passed to the kernel but is used by subsequent pragmas.
3#pragma kernel set(items_per_block=threads_per_block * items_per_thread)
The set
directives defines a constant.
In this case, the constant items_per_block
is defined as the product of threads_per_block
and items_per_thread
.
4#pragma kernel problem_size(n)
5#pragma kernel block_size(threads_per_block)
6#pragma kernel grid_divisor(items_per_block)
The problem_size
directive defines the problem size (as discussed in Basic Example), block_size
specifies the thread block size, and grid_divisor
specifies how the problem size should be divided to obtain the thread grid size.
Alternatively, grid_size
can be used to specify the grid size directly.
7#pragma kernel buffers(C[n], A[n], B[n])
The buffers
directive specifies the size of each buffer (A
, B
, and C
) as n
elements to be known by Kernel Launcher.
This is necessary since raw pointers can be used for buffer arguments, for which size information may not be available.
If the buffers
pragma is not specified, Kernel Launcher can still be used but it is not possible to capture kernel launches.
8#pragma kernel tuning_key("vector_add_" + T)
The tuning_key
directive specifies the tuning key, which can be a concatenation of strings or variables.
In this example, the tuning key is "vector_add_" + T
, where T
is the name of the type.
Host Code
The code below shows how to call the kernel from the host in C++:
#include "kernel_launcher/pragma.h"
using namespace kl = kernel_launcher;
void launch_vector_add(float* C, const float* A, const float* B) {
kl::launch(
kl::PragmaKernel("vector_add_annotated.cu", "vector_add", {"float"}),
n, C, A, B
);
);
The PragmaKernel
class implements the IKernelDescriptor
interface, as described in Kernel Registry.
This class reads the specified file, extracts the Kernel Launcher pragmas from the source code, and compiles the kernel.
The launch
function launches the kernel and, as discussed in Kernel Registry, it uses the default registry to cache kernel compilations.
This means that the kernel is only compiled once, even if the same kernel is called from different locations in the program.
List of pragmas
The table below lists the valid directives.
Directive |
Description |
|
Add a new tunable variable. |
|
Add a new variable. |
|
Specify the size of buffer arguments. This directive may occur multiple times. |
|
Specify the tuning key used to search for the corresponding wisdom file. |
|
An N-dimensional vector that indicates workload size. |
|
An N-dimensional vector that indicates the CUDA grid size. |
|
An N-dimensional vector that indicates the CUDA thread block size. |
|
Alternative way of specifying the grid size. The problem size is divided by the grid divisors to obtain the grid dimensions. |
|
Boolean expression that must evaluate to |