CUDAfication: NVIDIA GPU Parallel Processing for Data

NVIDIA’s GPU (graphics processing unit) technology originated to support the intensive video processing required by increasingly “realistic” computer gaming. We can explain this technology in simplistic terms. Achieving an immersive quality in a video game requires very rapid transformations of all the millions of pixels on the monitor screen. While the values are different for each pixel, and the transformations applied to a given pixel may vary, the calculation logic is the same for each pixel. More importantly, the result of a calculation for one pixel is independent of the calculation for any of the other pixels. Thus, these separate calculations can be done in parallel by thousands of processors that populate data for all pixels with extreme speed for each video frame. This description is admittedly an oversimplification, but it gives a general idea of the power of parallel processing.

Compare this with the traditional CPU (central processing unit), as in the average desktop computer. With a CPU, each of the pixel calculations would have to be done sequentially, rather than in parallel. A CPU simply could not handle the processing demands in many modern video games. Nowadays the latest CPUs offer some parallelism with multiple cores (perhaps 24 in high-end CPUs). In contrast, the latest NVIDIA Titan V processor has 5,120 cores. Most software programs for CPUs are written using sequential logic – whether or not they could be parallelized in whole or in part.

NVIDIA has for years been at the forefront of GPU development. As scientists and developers began to anticipate the promise of parallel computing in many other areas besides video games, NVIDIA developed the CUDA platform, first released in 2006, enabling developers to write software in C or C++ for an NVIDIA GPU.

The ability to use parallel processing is not a given. There are processing problems that are called “embarrassingly parallel” – meaning that each data element is treated the same, and the values in each data element are independent of any other (as we described above for video processing). For example, if you have a two-element array of product SKUs and product prices, and you want to apply a 5% discount across the board, this is an embarrassingly parallel problem: discounting the price for each element is independent. Suppose you have a three-element array of all product SKUs, product Categories, and prices, and you want to discount the price of all Raincoats by 10% (assume RAINCOAT is a Category or has a Category Code in the array). This is still an embarrassingly parallel problem, even though only Raincoats will be discounted. The logic is: if the SKU is a Raincoat, apply the discount to price, otherwise not. Note that order of Categories in the array list is irrelevant. No sorting is required, and the logic can be applied “simultaneously” to all elements in the array.

However, many processing problems are not embarrassingly parallel. For example, if you want to calculate the average price for all SKUs or all Raincoats, you normally go through each element one-by-one and keep a running total, and then divide that by the number of elements found. The result at each element depends on a cumulative result from all the prior elements. This cannot be done in parallel but requires sequential logic.

One of the major challenges in CUDA programming is how to handle problems that are not embarrassingly parallel. This requires some clever tricks and techniques. A problem may be broken down into pieces, where parallelism can be applied, and the pieces are then reassembled in various ways. Interestingly, some of the great CUDA programmers, typically those who write CUDA libraries, have been able to provide functionality that is easily 20 to 100 times faster than the benchmark CPU programs, even though the logic would seem to be fundamentally sequential.

NVIDIA GPUs have become the premier platform for:

Supercomputing. A present-day supercomputer may comprise thousands of GPUs.
Advanced scientific research and programming. Specialty libraries facilitate research with previously unheard-of speeds.
Artificial Intelligence (AI). GPUs have facilitated Neural Networks and Deep Learning with unprecedented performance. The uses for AI are growing daily.
Blockchain. GPUs can greatly accelerate the intensive calculations required for blockchain – recently leading to a doubling or tripling of prices on many GPU boards.

This summary doesn’t begin to do justice to the uses and applications of NVIDIA GPU technology. For example, NVIDIA AI is not only at the heart of the self-driving car technologies, it is also being used in many new scientific applications; for example, AI has recently been used to greatly accelerate the search for new planets.

I want to highlight one area that gets very little attention – where GPU parallel processing can deliver huge performance boosts (typically 100 times speed improvement or better) – and that is in data processing and manipulation. I have had the opportunity to work on several such projects. Data processing in CUDA isn’t a simple given. It requires specialized techniques custom-designed to fit specific challenges. Starting with the CUDA libraries ModernGPU and CUB, which provide essential core functionality, the developer can design and build custom algorithms for data. I have also developed some useful extensions to these libraries, such a multi-criteria sorting, which are significantly valuable working with data. Data consolidation/compression, transferring text such as UTF-8 CSV into arrays, running queries on unsorted data, running multiple queries simultaneously – are among the types of processing that can be done – and all these processes can run virtually instantaneously! Each scenario is different and may require a creative design to solve. However, if speed and performance are crucial, CUDA may be the best solution, and perhaps the only viable one.