CS40 Lab 7: CUDA

Due 11:59pm Wednesday, 7 November 2018

For this lab, you'll extend, write, and test some CUDA kernels.

Initializing git repos

You may work with one partner on this lab. We will use Teammaker for creating teams. Clone your personal copy of the starter code for lab 07 as shown below.
$ cd
[~]$ ssh-add
Enter passphrase for /home/ghopper1/.ssh/id_rsa:
Identity added: /home/ghopper1/.ssh/id_rsa (/home/ghopper1/.ssh/id_rsa)
[~]$ cd ~/cs40/
[labs]$ git clone git@github.swarthmore.edu:CS40-F18/lab07-YOURUSERNAME-YOURPARTNERNAME.git ./lab07

Making and building code

Make a build directory in your lab07 directory and run cmake with the special path for CUDA, then run make.
[~]$ cd ~/cs40/lab07
[~]$ mkdir build
[~]$ cd build
[build]$ cmake -DCMAKE_CUDA_HOST_COMPILER=/usr/bin/g++-6 \
      -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc ../
[build]$ make -j8
There are two executables for you to modify: maxval and circleArt.
Part 1: Maximum Value
First compile and run maxval.cu. This program is supposed to compute the maximum of an array of floats. Initially a CPU only version has been provided for you. I provided various timing code to time the GPU and CPU versions of this max function. Your first step is to write a simple CUDA kernel that works with only multiple threads and one block. Because a global CUDA kernel can only have a void return type, the variable partial_c can be used to hold a GPU buffer that can store one or more results. Call your kernel max_gpu_thread, and have it store the max value amongst all the threads in the first index of the results buffer. Call your kernel in main with one block and one thread, and note the time. Check that your GPU result matches the CPU result before proceeding.

As noted in the code, you should use shared memory and a parallel reduction within each block. You can use the dot.cu demo as a guide for the setting up shared memory. The parallel reduction we finished on Friday for the dot product is pasted below. You can use a similar pattern to compute your max value.

/* This reduction assumes blockDim is a power of 2
     A reduction can work for any size, but the
     indexing becomes a little more tricky */
int i=blockDim.x/2;
while(i != 0 ){
  if(threadIdx.x < i){
    /*reduce!*/
    cache[threadIdx.x] += cache[threadIdx.x+i];
  }
  __syncthreads();
  /*outside branch but inside loop to prevent deadlock  */
  i/=2;
}

Next, change the size of N in main from 32 to 32*1024*1024. Run your code and note the time for the GPU and CPU versions. If your GPU version is significantly slower, that is OK at this point. Next, make the following changes and run some experiments. Note, the GPU version may be so slow that it times out. If this happens, decrease the size of N until the kernel is able to finish.

Experiments

Thread Kernel

Call your max_gpu_kernel kernel with the following number of threads and record the time: 32, 64, 256, 512. In this version, you should only have one block, so all threads in a block can communicate and determine a single answer for the max of the array. Using shared memory, compute the global answer and write one value to *c. Since the number of threads will vary, you can overprovision the size of the shared memory array to be the maximum number of threads (512 or even 1024)

Combination Kernel

Next write a kernel called max_gpu that can be called on an arbitrary number of blocks, each with multiple threads. Try various block and thread counts when calling your kernel, reporting at least three experiments and highlighting the parameters that result in the shortest run time. At the thread level, you should use shared memory and a parallel reduction to compute the maximum value per block.

Test on two Cards

Test your code on at least two different graphics cards. See the list of Host graphics cards. The specs in terms of multiprocessors, cores per multiprocessor, and total number of cores is summarized below. The 1000M cards at the bottom of this list are very old and may not even run the code samples. Please avoid machines with this card. The M1000M machines are fine.
Card MP Cores/MP Total
P5000201282560
GTX 1080201282560
GTX 980Ti221282816
GTX 7504128512
M1000M4128512
M6004128512
K2100M3192576
1000M264128
Part 2: Circle Art
In this part, you will use CUDA an the QtVis library to render and animate a list of semi transparent circles on the screen. This labs was adapted for the QtVis library from a similar assignment at CMU. You may find some of the reading there helpful, but much of it is particular to their setup.

In this application we assume our view is a static 2D world view bounded between 0 and 1 in the $x$ and $y$ directions. This world view is rendered onto an image containing $w$ columns and $h$ rows of pixels with $(r,c)=(0,0)$ in the upper left and wold coordinates $(x,y)=(0,0)$ in the lower left.

In this world, we have a set of Circles described in the lightweight class circle.h. Each circle has a center position, and radius, a color, along with a velocity (vx,vy,vz) and transparency between 0 and 1, 0 being completely transparent, 1 being opaque.

The functions in scenes.h can generate a set of circles in this world and return them as a CircScene struct containing a pointer to an array of circle objects and a count of the total number of objects. Because CUDA cannot easily compile complex C++ code including Qt5 code, we don't have the luxury of using QVector3D or similar QList objects to store our information.

In viewer.cpp we create a QTViewer application like several examples in class. We additionally make a scene and pass the resulting circles to a new Animator subclass you will create called CircleRenderer. Your primary task is to complete several CUDA kernels that will draw the set of circles into a buffer, and also animate the scene.

Understanding the Renderer

Every CUDA Animator in the QTViewer framework must implement a update(ImageBuffer *img) method that writes colors to an array of pixels stored in img->buffer. You can see examples of how this is done in several of the in class examples as well as the clearImage kernel which has been completed for you. The update method gets called repeatedly in an openGL event loop and, if you draw a new image each time update is called, you will see an animation.

For this example, the primary update loop will be

clear the entire screen
move the circles according to their velocity
draw the circles in the image buffer
Right now, only clear is implemented. It takes a clear color as input and writes that color to every pixel in the image buffer. You should examine the implementation of clearImage in clearImage.cu to understand how it works and how to work with the ImageBuffer.

drawCircsWrong has an implementation that attempts to draw the circles using CUDA, but it isn't quite right. It works OK if you only have a single thread, but that isn't very parallel or scalable. drawCircsWrong parallelizes the problem the distributing all the circles over all the threads. For each circle, a thread computes a bounding box around the circle, and finds the rows and columns of the image contained within that box. For each pixel in the box, it checks if the center of the pixel is inside the circle. If the pixel overlaps, the old color in the image buffer and the color of the circle are blended together according to the blending function provided. You should not modify the blending function. However, you should note that the order of blending the circles matters and can lead to different results. Your rendering must preserve the order of the circles as they are presented in the initial input array. drawCircsWrong may violate this property and the Readme asks you to think about why.

The correct order for the three test circles is to draw the red circle first, followed by the green then blue.

Two possible incorrect orders are blue, then green, then red

or green, then red, then blue.

You need to find a way to properly implement drawCircs. Implement your code in drawCircs, keeping drawCircsWrong unmodified. To test your solution, don't forget to change the call in update in circRenderer.cu to call your drawCircs kernel.

Once you think you have the drawing correct, implement the updating of your scene by writing one more kernel moveCircs. Note that this kernel does not draw anything, but it simply updates the position of each circle to be the previous position plus the current velocity. Once the positions have been modified, the next call to drawCircs will draw the circles in their new location. If a circle moves completely off one edge of the screen, have it wrap around to the other side as demonstrated in lab.

Adjust numFrames in viewer.cpp to test your animation. Set numFrames=-1 to let it loop until you quit.

Readme
Complete the concept questions and lab survey prior to submitting. As both the QtVis library and the Circle Renderer are new assignments, I'd appreciate any feedback if there are aspects that are unclear or could be improved.
Submit
You should regularly commit your changes to the lab07 repo and occasionally push changes to the github remote. Note you must push to your remote to share updates with your partner. Ideally you should commit changes at the end of every session of working on the project. You will be graded on work that appears in your github repo by the project deadline.