Week 8: CUDA

Code

You can get sample CUDA code from github:

git clone git@github.swarthmore.edu:CS40-F25/cuda.git

Finding CUDA

To help CMake find CUDA on your system, we will adjust some paths on our system. By default, student shells are set to bash and a special file ~/.bashrc sets some environment variables when you log in. We will add some lines to this file to help CMake find CUDA. Open your ~/.bashrc file in your favorite text editor and add the following lines to the end of the file:

export PATH=$PATH:/usr/local/cuda/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib/:/usr/local/cuda/lib64

If you want the changes to take effect immediately, you can run the following command in your terminal:

source ~/.bashrc

In general, the next time you log in, these changes will take effect automatically without needing to run the source command again.

After making the changes, you may need to remove your build directory and re-run CMake to ensure it picks up the new environment variables.

cd ~/cs40/cuda/
ls  # check for build/ directory
rm -rf build/  # remove build/ directory if it exists
mkdir build
cd build
cmake ..
make -j8

Examples

We’ll look at the examples in the following order:

hello_world.cu
simple_kernel.cu
simple_kernel_params.cu
simple_device_call.cu
add_loop_cpu.cpp
add_loop_gpu.cu
add_loop_long.cu
enum_gpu.cu
deviceQuery (installed on system)

CUDA Overview

CUDA (Compute Unified Device Architecture) is a parallel computing platform and application programming interface (API) model created by NVIDIA. The ideas behind CUDA and General Purpose GPU (GPGPU) programming are behind the scenes of many recent AI advances. Note that CUDA is specific to NVIDIA GPUs, but all of our CS clients have NVIDIA GPUs.

Let’s take a brief look at CUDA and how it works.

Graphics and GPGPU Applications tend to have the following similarities.

Both are highly parallel workloads. Graphics workloads often involve processing millions of pixels or vertices in parallel. GPGPU applications often involve large datasets that can be processed in parallel.
Vector and matrix operations are common in both graphics and GPGPU applications
The general workflow is:
1. Allocate memory on the GPU
2. Transfer data from CPU to GPU
3. Run code in parallel on the GPU
4. Transfer results back from GPU to CPU
5. Free GPU memory

How do we complete these steps in OpenGL?

In CUDA, we use cudaMalloc and cudaFree to allocate and free memory on the GPU. We use cudaMemcpy to transfer data between the CPU and GPU. To run code in parallel on the GPU, we define functions called kernels that are executed by multiple blocks/threads in parallel. Each block/thread has its own unique identifier that can be used to determine which part of the data it should process.

CUDA extends the C language with some new keywords and syntax to define kernels and manage memory.

global kernels

__global__ is a CUDA keyword used to declare a function as a kernel that is 1. launched from the CPU and 2. runs in parallel on the GPU. All __global__ functions must have a void return type.

Launching kernels

To call a CUDA kernel from the CPU we specify the name of the kernel, and the grid dimension inside the special <<<gridDim, blockDim>>> brackets to specify the parallelism. In the example below, we are saying we want one block per grid and one thread per block. Each thread will execute the kernel function independently.

__global__ void kernel(void) {}

int main(void) {
   kernel<<<1, 1>>>();
   printf("Hello\n");
   return 0;
}

device and host modifiers

The __device__ keyword is used to declare a function or variable that is stored on the GPU and can only be called from other GPU code (i.e., from within a __global__ kernel or another __device__ function).

The __host__ keyword is used to declare a function that is stored on the CPU and can only be called from other CPU code. By default, functions are considered __host__ if no modifier is specified. Often, you will see functions declared with both __host__ and __device__ modifiers to indicate that they can be called from both CPU and GPU code.

The compute grid: blocks and threads

CUDA organizes parallel execution using a grid of blocks, where each block contains multiple threads. Each thread executes the kernel function independently, allowing for massive parallelism.

When launching a kernel, you specify the number of blocks and the number of threads per block using the <<<gridDim, blockDim>>> syntax. For example, <<<2, 4>>> would launch a grid with 2 blocks, each containing 4 threads. In this case, there are total of 8 independent threads executing the kernel function.

Each block/thread can be 3-dimensional, allowing for more complex data structures and parallelism. The dim3 type is used to specify the dimensions of blocks and grids and has components x, y, and z. You can use dim3 types on both the CPU and GPU sides.

dim3: A 3D vector type used to specify dimensions of blocks and grids.
gridDim (dim3): A built-in variable that contains the dimensions of the grid (in blocks).
blockDim (dim3): A built-in variable that contains the dimensions of the block (in threads).
blockIdx (dim3): A built-in variable that contains the index of the current block within the grid
threadIdx (dim3): A built-in variable that contains the index of the current thread within the block

The variables gridDim, blockDim, blockIdx, and threadIdx are available within __global__ kernel functions or __device__ functions to help identify the specific thread and block executing the code.

You can run deviceQuery to see the maximum number of threads per block and other properties of the GPU. In the CUDA context, a block is assigned to one of the GPU’s streaming multiprocessors (SMs). Use deviceQuery to see how many Multiprocessors your GPU has. The threads within a block are executed in groups of 32 called warps. The GPU schedules warps for execution on the SMs. If a block has more than 32 threads, multiple warps are created to handle all the threads in the block. The number of threads per block is limited by the GPU architecture, typically 1024 threads per block on modern GPUs. The number of threads that can be executed concurrently on a single multiprocessor is also limited. You can use deviceQuery to see the maximum number of concurrent threads per multiprocessor on your GPU. This is given by the CUDA Cores per MP value.

You can launch kernels with more blocks than multiprocessors or more threads per CUDA core. The GPU will schedule the blocks and threads for execution as resources become available, though tuning your grid and block dimensions to a specific GPU architecture can improve performance.

Adding two arrays in parallel

Let’s look at a simple example of adding two arrays in parallel using CUDA. Each thread will be responsible for adding one element from each array and storing the result in a third array.

First look at add_loop_cpu.cpp, which performs the addition on the CPU using a simple for loop. Then look at add_loop_gpu.cu, which performs the same operation using CUDA on a small array. Finally, look at add_loop_long.cu, which extends the CUDA example to handle larger arrays by launching multiple blocks and threads.

In the add_loop_long.cu example, we use 128 blocks to start. Each block will have to process multiple elements of the array since our array size is over a million elements. Note the while loop in the kernel that allows each thread to process multiple elements of the array.

Let’s time the block level parallelism on the GPU versus the single threaded CPU version. How could you change this to use thread level parallelism? How can you change the kernel to use both block and thread level parallelism?