What's Up With the GPU?

In my previous post I discussed the various multiprocessing technologies at my disposal. These are: OpenMP, MPI, PPL (Microsoft Concurrency Runtime), the ACIS thread manager, and the CGM multiprocessing infrastructure. As it turns out, I overlooked a completely relevant technology, the GPU. I would like to correct this oversight and pass along my experiences in adapting the primality algorithm used in the previous discussion to run on a GPU. (Look here for a brief introduction to GPU computing)

I’m working remotely at the moment and decided to avoid the potential headaches associated with accessing a video card through a remote desktop connection by simply using my laptop computer for this exercise. It does after all have a decent video card, an NVIDIA Quadro 1000M with 96 cores and 2GB of RAM. In analyzing compatible development tools for my video card, I chose CUDA, which is a parallel computing platform and programming model invented by NVIDIA. (Look here for a brief introduction to CUDA)

I downloaded and installed the latest CUDA Toolkit (version 4.1), associated video driver, and example codes from NVIDIA. Then I went through the documentation and a few sample programs to get a feel for the task at hand. I quickly identified several aspects of GPU programming that were different from what I was used to, namely how jobs are presented to the GPU, how they are broken down into tasks to fit the available hardware, and how the tasks are computed.

Jobs are computed on the GPU using what is called a kernel launch. A job is typically made up of many tasks. In our example, the job is to “find all prime numbers between one and one hundred million”. This can be broken down into one hundred million tasks in which each calculates the primality of one single number. The kernel to launch is essentially a function with arguments that gets called for each task. It determines its index through variables made available by the CUDA architecture (discussed later), then performs the operation on the input data taken from the index location in the input array, and stores the results again using the index into the output array.

This is a classic Single-Instruction-Multiple-Data (SIMD) processing architecture, where each processor is mapped to a unique index and executes the exact same set of instructions on the corresponding input data - on all processors - at the same time. In our example, the input data is an array containing all the numbers between one and one hundred million. We hand this array to the GPU as an argument to the kernel function and it calculates the primality of each element in the array, overwriting the corresponding array data with the results of the computations, in this case either true or false.

Calculating the unique index in each task is not as straightforward as you might think. In CUDA, a job is broken down into a multidimensional grid of blocks, where the number of tasks is equal to the number of blocks in the grid times the number of threads in a block. These values, the number of blocks and the number of threads per block, are specified when the kernel function is launched. Corresponding variables are available in the kernel function to compute the current index based on these values.

Calculating an index might look something like this:

index = blockIdx * blockDim + threadIdx;

Where blockIdx is the current block index within the grid, blockDim is the number of threads in the block, and threadIdx is the current thread index within the block. To make things a bit more flexible (complex), the grid size and block size can be specified in multiple dimensions. This is necessary to overcome limitations that would otherwise severely restrict the number of tasks that can be performed in any single kernel launch.

For practical purposes there is an upper limit to the number of threads available at any given time. In CUDA this translates into the maximum number of threads that can be used in any block, which on my system is 1024. Because of this, we would need roughly one-hundred-thousand blocks to accommodate the target number of inputs in our example - in one launch. Unfortunately, there is also an upper limit to the number of blocks, which on my system is 65535. To overcome this limitation, CUDA provides a multi-dimensional grid of blocks. My system supports a grid of 65535 x 65535 x 65535 blocks, each with 1024 threads. That’s a large number of indices available for any one operation.

Next come the input and output arrays, which in our example is one and the same. Allocating the array of input values is simple in native code (referred to as the host in CUDA speak), and as it turns out is simple with CUDA as well. It’s accomplished with cudaMalloc and cudaFree, which allocates and frees memory on the device respectively. The typical approach is to allocate identical arrays on both the host and device and use cudaMemcpy to transfer the contents back and forth.

Here is where my respect for resources causes me to deviate from my original plan. Since even numbers are never prime (except 2) we can cut the array size in half and only consider odd numbers. This changes things quite a bit however, from the size of the input/output array to the adjustment of the index value, to the number of operations that are actually performed. Nonetheless, I am willing to extend this courtesy to the GPU for optimizations sake, even though it’s cheating a bit.

So now I have a program that allocates and initializes an array on both the host and device, copies it, then launches a kernel with appropriate grid and block dimensions, copies the array data back to the host, validates the results, and finally frees memory. The kernel function calculates the corresponding index, loads the input value, tests for primality, and writes the result back to the array.

When testing the code I naturally began with smaller ranges of inputs, to make sure everything was working as expected. I got a big surprise when I finally ran with the target range of one hundred million. The program failed! As it turns out, the Windows operating system terminates operations on the video device that take longer than a few seconds. This is known behavior called Timeout Detection and Recovery (TDR). I could have fiddled with registry settings to disable TDR but instead decided to simply restructure the program to have multiple kernel launches.

From experimentation I found it safe to process one million inputs at a time. So all I had to do was to launch the kernel from within a loop, passing the current iteration to the kernel function, and adjusting the index accordingly. That’s it. I now have a complete and working program.

View Main Function

View Kernel Function

The performance was the next big surprise. The serial run on my laptop took 109 seconds, using OpenMP and four processors it was reduced to 27 seconds, and in comparison the GPU run took 53 seconds. This was unexpected at first, given that we have 96 processors available, until I realized that the tasks are very unbalanced. Determining primality is quick in some cases and may take a long time in others, especially when the number is large and prime.

The drawback with tasks of varying complexity for the GPU is that each operation, with whatever chunk of inputs is ultimately scheduled, will take as long as the most complex task. My video card has 96 processors, which makes me suspect that the inputs are processed in chunks of 96. If calculating primality is mostly simple, then most processors will sit idle while a few are working on complex cases. In contrast, independent processors are seldom idle because they can simply move on to the next task.

As an experiment, l wanted to test the capabilities of my GPU with a more balanced task complexity. So I picked a very large prime number to use for every task, and modified the job to compute the same task one million times. The serial operation took 22.2 seconds, the OpenMP version took 5.6, and the GPU version took 1.5 seconds. Now that is the result I was hoping to see.

The GPU certainly has a place in the multiprocessing arena, but I think it can be very challenging to find applicable operations. For many years now we have been analyzing the performance of ACIS, and to date have never found a situation that would be best serviced by the GPU. I do think however, that many applications exist that can benefit greatly by utilizing the GPU. I’m sure it’s the right tool for some problems.

I would like to hear if anyone has found a good use for the GPU . . .?