YARP  2.3.68+228-20170410.2+git7d0b2e0
Yet Another Robot Platform
Handling NVIDIA GPUs with YARP
Author
Giacomo Spigler

Introduction

In the last years graphics cards became more and more powerful, and eventually they reached computing speeds of the order of hundreds of gigaflops. Graphics Processing Units, or GPUs are dedicated processors, cheap enough to be affordable by everyone and powerful enough to be able to replace clusters of tens of modern computers. YARP has now two new modules, which come with the main distribution (starting from the latest releases): CUDA and NVIDIA. This documentation only regards CUDA, even thought it is compatible only with the latest Nvidia chipsets. The other driver is slower and worse optimized. These new modules rely on the new IGPUInterface class, which can be easily found into libYARP_dev's includes.

Example: YARP application

Programming with the CUDA module is quite easy, but there are some requirements. Let's start with an example. The full code of this example can be found under the example/cuda directory. First of all, make sure to include the following headers into your application:

using namespace yarp::dev;

Then you can proceed creating an instance to the driver's object:

Property config;
// no arguments, use a default
char str[80];
sprintf(str, "(device cuda) (w %d) (h %d) (bpp 3)", img.width(), img.height());
config.fromString(str);
PolyDriver dd(config);
if (!dd.isValid()) {
printf("Failed to create and configure a device\n");
exit(1);
}
IGPUDevice *gpu;
if (!dd.view(gpu)) {
printf("Failed to view device through IGPUDevice interface\n");
exit(1);
}

Example: GPU program

Now you have a working CUDA driver set up and configured. However, to start using it, you have to load the programs you want to execute on your data. Programs are in cubin format, which is a kind of ascii executable and that the GPU can understand. We will see later on how to compile custom programs. To load them, just do:

int prog = gpu->load("progs/bw.cubin");

Then you can finally execute it. Processing on data is accomplished either with c arrays of elements or YARP images. There are two overloaded methods CUDA has to handle it, IGPUInterface::execute, which can handle "unsigned char *" or "ImageOf<PixelRgb>". Support for floats might come soon.

gpu->execute(prog, &img, &out);

At the end of the program, just put:

dd.close();

Now you have a YARP application, but you need some programs to run on the GPU. Program's structure has always to be the same (for more information I'd suggest the CUDA programming manual). Following, it is an example program (to convert RGB images to BW):

extern "C" {
__global__ void FragmentProgram(int w, int h, unsigned char *in, unsigned char *out) {
int i;
for(i=threadIdx.x+blockIdx.x; i<w*h; i+=blockDim.x*gridDim.x) {
out[i*3]=(in[i*3]+in[i*3+1]+in[i*3+2])/3;
out[i*3+1]=out[i*3];
out[i*3+2]=out[i*3];
}
}
}

Let's look at it. Just a quick note, the software automatically defines some objects, namely threadIdx, blockIdx, blockDim and gridDim. At the moment, for ease of usage, they are just used 1D, accessed as threadIdx.x. The first one is the id of the thread which executes the main function (within its current block); the second one is the id of the block within the block grid. blockDim is the numbe of threads contained in a single block, and gridDim is the size of the block grid.

Then, the main function (at the moment just use the main function) has to be named FragmentProgram, and to have a "__global__" before its type. Type is not important, so you might always set it to void. The FragmentProgram function will run on every GPU's core, executing the same instructions (SIMD approach), but every thread will have its own threadIdx.x identifier. At last we have to compile the GPU program into the binary .cubin object:

nvcc -cubin bw.cu

Using this information, and knowing the number of threads currently running, you can write parallel code to run on the GPU, analyzing input data and generating outputs.