# Full CUDA example

This page explains a CUDA program that estimates the value of pi using Kernel Float.


## Overview

The program calculates Pi by generating random points within a unit square and counting how many fall inside the unit circle inscribed within that square. The ratio of points inside the circle to the total number of points approximates Pi/4.

The kernel is shown below:


```c++
namespace kf = kernel_float;

using float_type = float;
static constexpr int VECTOR_SIZE = 4;

__global__ void calculate_pi_kernel(int nx, int ny, int* global_count) {
    int thread_x = blockIdx.x * blockDim.x + threadIdx.x;
    int thread_y = blockIdx.y * blockDim.y + threadIdx.y;

    kf::vec<int, VECTOR_SIZE> xi = thread_x * VECTOR_SIZE + kf::range<int, VECTOR_SIZE>();
    kf::vec<int, VECTOR_SIZE> yi = thread_y;

    kf::vec<float_type, VECTOR_SIZE> xf = kf::cast<float_type>(xi) / float_type(nx);
    kf::vec<float_type, VECTOR_SIZE> yf = kf::cast<float_type>(yi) / float_type(ny);

    kf::vec<float_type, VECTOR_SIZE> dist_squared = xf * xf + yf * yf;
    kf::vec<float_type, VECTOR_SIZE> dist = kf::sqrt(dist_squared);

    int n = kf::count(dist <= float_type(1));

    if (n > 0) atomicAdd(global_count, n);
}
```


## Code Explanation

Let's go through the code step by step.

```cpp
// Alias `kernel_float` as `kf`
namespace kf = kernel_float;
```

This creates an alias for `kernel_float`.

```cpp
// Define the float type and vector size
using float_type = float;
static constexpr int VECTOR_SIZE = 4;
```

Define `float_type` as an alias for `float` to make it easy to change precision if needed.
The vector size is set to 4, meaning each thread will process 4 data points.

```cpp
__global__ void calculate_pi_kernel(int nx, int ny, int* global_count) {
```

The CUDA kernel. There are `nx` points along the x axis and `ny` points along the y axis.

```cpp
    int thread_x = blockIdx.x * blockDim.x + threadIdx.x;
    int thread_y = blockIdx.y * blockDim.y + threadIdx.y;
```

Compute the global x- and y-index of this thread.

```cpp
    kf::vec<int, VECTOR_SIZE> xi = thread_x * VECTOR_SIZE + kf::range<int, VECTOR_SIZE>();
    kf::vec<int, VECTOR_SIZE> yi = thread_y;
```

Compute the points that this thread will process.
The x coordinates start at `thread_x * VECTOR_SIZE` and then the vector `[0, 1, 2, ..., VECTOR_SIZE-1]`.
The y coordinates are all `thread_y`.

```cpp
    kf::vec<float_type, VECTOR_SIZE> xf = kf::cast<float_type>(xi) / float_type(nx);
    kf::vec<float_type, VECTOR_SIZE> yf = kf::cast<float_type>(yi) / float_type(ny);
```

Divide `xi` and `yi` by `nx` and `ny` to normalize them to `[0, 1]` range.

```cpp
    kf::vec<float_type, VECTOR_SIZE> dist_squared = xf * xf + yf * yf;
```

Compute the squared distance from the origin (0, 0) to each point from `xf`,`yf`.

```cpp
    kf::vec<float_type, VECTOR_SIZE> dist = kf::sqrt(dist_squared);
```

Take the element-wise square root.

```cpp
    int n = kf::count(dist <= float_type(1));
```

Count the number of points in the unit circle (i.e., for which the distance is less than 4).
The expression `dist <= 1` returns a vector of booleans and `kf::count` counts the number of `true` values.

```cpp
atomicAdd(global_count, n);
```

Add `n` to the `global_count` variable.
This must be done using an atomic operation since multiple thread will write this variable simultaneously.