In the last years graphics cards became more and more powerful, and eventually they reached computing speeds of the order of hundreds of gigaflops. Graphics Processing Units, or GPUs are dedicated processors, cheap enough to be affordable by everyone and powerful enough to be able to replace clusters of tens of modern computers. YARP has now two new modules, which come with the main distribution (starting from the latest releases): CUDA and NVIDIA. This documentation only regards CUDA, even thought it is compatible only with the latest Nvidia chipsets. The other driver is slower and worse optimized. These new modules rely on the new IGPUInterface class, which can be easily found into libYARP_dev's includes.
Programming with the CUDA module is quite easy, but there are some requirements. Let's start with an example. The full code of this example can be found under the example/cuda directory. First of all, make sure to include the following headers into your application:
Then you can proceed creating an instance to the driver's object:
Now you have a working CUDA driver set up and configured. However, to start using it, you have to load the programs you want to execute on your data. Programs are in cubin format, which is a kind of ascii executable and that the GPU can understand. We will see later on how to compile custom programs. To load them, just do:
Then you can finally execute it. Processing on data is accomplished either with c arrays of elements or YARP images. There are two overloaded methods CUDA has to handle it, IGPUInterface::execute, which can handle "unsigned char *" or "ImageOf<PixelRgb>". Support for floats might come soon.
At the end of the program, just put:
Now you have a YARP application, but you need some programs to run on the GPU. Program's structure has always to be the same (for more information I'd suggest the CUDA programming manual). Following, it is an example program (to convert RGB images to BW):
Let's look at it. Just a quick note, the software automatically defines some objects, namely
gridDim. At the moment, for ease of usage, they are just used 1D, accessed as
threadIdx.x. The first one is the id of the thread which executes the main function (within its current block); the second one is the id of the block within the block grid.
blockDim is the numbe of threads contained in a single block, and
gridDim is the size of the block grid.
Then, the main function (at the moment just use the main function) has to be named FragmentProgram, and to have a "__global__" before its type. Type is not important, so you might always set it to void. The FragmentProgram function will run on every GPU's core, executing the same instructions (SIMD approach), but every thread will have its own threadIdx.x identifier. At last we have to compile the GPU program into the binary
Using this information, and knowing the number of threads currently running, you can write parallel code to run on the GPU, analyzing input data and generating outputs.