Full CUDA example
This page explains a CUDA program that estimates the value of pi using Kernel Float.
Overview
The program calculates Pi by generating random points within a unit square and counting how many fall inside the unit circle inscribed within that square. The ratio of points inside the circle to the total number of points approximates Pi/4.
The kernel is shown below:
namespace kf = kernel_float;
using float_type = float;
static constexpr int VECTOR_SIZE = 4;
__global__ void calculate_pi_kernel(int nx, int ny, int* global_count) {
int thread_x = blockIdx.x * blockDim.x + threadIdx.x;
int thread_y = blockIdx.y * blockDim.y + threadIdx.y;
kf::vec<int, VECTOR_SIZE> xi = thread_x * VECTOR_SIZE + kf::range<int, VECTOR_SIZE>();
kf::vec<int, VECTOR_SIZE> yi = thread_y;
kf::vec<float_type, VECTOR_SIZE> xf = kf::cast<float_type>(xi) / float_type(nx);
kf::vec<float_type, VECTOR_SIZE> yf = kf::cast<float_type>(yi) / float_type(ny);
kf::vec<float_type, VECTOR_SIZE> dist_squared = xf * xf + yf * yf;
kf::vec<float_type, VECTOR_SIZE> dist = kf::sqrt(dist_squared);
int n = kf::count(dist <= float_type(1));
if (n > 0) atomicAdd(global_count, n);
}
Code Explanation
Let’s go through the code step by step.
// Alias `kernel_float` as `kf`
namespace kf = kernel_float;
This creates an alias for kernel_float
.
// Define the float type and vector size
using float_type = float;
static constexpr int VECTOR_SIZE = 4;
Define float_type
as an alias for float
to make it easy to change precision if needed.
The vector size is set to 4, meaning each thread will process 4 data points.
__global__ void calculate_pi_kernel(int nx, int ny, int* global_count) {
The CUDA kernel. There are nx
points along the x axis and ny
points along the y axis.
int thread_x = blockIdx.x * blockDim.x + threadIdx.x;
int thread_y = blockIdx.y * blockDim.y + threadIdx.y;
Compute the global x- and y-index of this thread.
kf::vec<int, VECTOR_SIZE> xi = thread_x * VECTOR_SIZE + kf::range<int, VECTOR_SIZE>();
kf::vec<int, VECTOR_SIZE> yi = thread_y;
Compute the points that this thread will process.
The x coordinates start at thread_x * VECTOR_SIZE
and then the vector [0, 1, 2, ..., VECTOR_SIZE-1]
.
The y coordinates are all thread_y
.
kf::vec<float_type, VECTOR_SIZE> xf = kf::cast<float_type>(xi) / float_type(nx);
kf::vec<float_type, VECTOR_SIZE> yf = kf::cast<float_type>(yi) / float_type(ny);
Divide xi
and yi
by nx
and ny
to normalize them to [0, 1]
range.
kf::vec<float_type, VECTOR_SIZE> dist_squared = xf * xf + yf * yf;
Compute the squared distance from the origin (0, 0) to each point from xf
,yf
.
kf::vec<float_type, VECTOR_SIZE> dist = kf::sqrt(dist_squared);
Take the element-wise square root.
int n = kf::count(dist <= float_type(1));
Count the number of points in the unit circle (i.e., for which the distance is less than 4).
The expression dist <= 1
returns a vector of booleans and kf::count
counts the number of true
values.
atomicAdd(global_count, n);
Add n
to the global_count
variable.
This must be done using an atomic operation since multiple thread will write this variable simultaneously.