Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Accessing class data members from within cuda kernel - how to design proper host/device interaction?

Tags:

c++

oop

cuda

I've been trying to transform some cuda/C code into a more OO code, but my goal doesn't seem to be easy to achieve for my current understanding of the cuda functioning mechanism. I haven't been able to find good a explanation either on this situation. It might not be possible after all.

I have a global object of class myClass holding an array to be filled in a kernel.

How should the methods in myClass be defined so that the array and boolean members are visible from device and the array can then be copied back to host? I am using cuda 7.5 and the compute capability of my card is 3.5.

This is a tentative structure describing the situation:

#include <cstdio>
#include <cuda.h>
#include <cuda_runtime.h>

class myClass
{
public:
        bool bool_var;    // Set from host and readable from device
        int  data_size;   // Set from host
        __device__ __host__ myClass();
        __device__ __host__ ~myClass();
        __host__ void setValues(bool iftrue, int size);
        __device__ void dosomething(int device_parameter);
        __host__ void export();

        // completely unknown methods
        __host__ void prepareDeviceObj();
        __host__ void retrieveDataToHost();
private:
        int *data; // Filled in device, shared between threads, at the end copied back to host for data output
};

__host__ __device__ myClass::myClass()
{
}

__host__ __device__ myClass::~myClass()
{
#ifdef __CUDACC__
        if(bool_var)
                cudaFree(data);
#else
        free(data);
#endif
}

__host__ void myClass::setValues(bool iftrue, int size)
{
        bool_var  = iftrue;
        data_size = size;
}

__device__ void myClass::dosomething(int idx)
{
        int toadd = idx+data_size;
        atomicAdd(&data[idx], toadd); // data should be unique among threads
}


__global__ void myKernel(myClass obj)
{
        const int idx = blockIdx.x*blockDim.x + threadIdx.x;
        if(idx < obj.data_size)
        {
                if(!obj.bool_var)
                        printf("Object is not up to any task here!");
                else
                {
                        printf("Object is ready!");
                        obj.dosomething(idx);
                }
        }
}


myClass globalInstance;

int main(int argc, char** argv)
{
        int some_number = 40;
        globalInstance.setValues(true, some_number);
        globalInstance.prepareDeviceObj();           // unknown
        myKernel<<<1,some_number>>>(globalInstance); // how to pass the object?
        globalInstance.retrieveDataToHost();         // unknown
        globalInstance.export();
        exit(EXIT_SUCCESS);
}
like image 364
learningProcess Avatar asked Aug 17 '16 21:08

learningProcess


People also ask

Which is the correct way to launch a CUDA kernel?

In order to run a kernel on the CUDA threads, we need two things. First, in the main() function of the program, we call the function to be executed by each thread on the GPU. This invocation is called Kernel Launch and with it we need provide the number of threads and their grouping.

What is host and device in CUDA?

In CUDA, the host refers to the CPU and its memory, while the device refers to the GPU and its memory. Code run on the host can manage memory on both the host and device, and also launches kernels which are functions executed on the device.

What is kernel in association with CUDA programming?

CUDA blocks are grouped into a grid. A kernel is executed as a grid of blocks of threads (Figure 2). Each CUDA block is executed by one streaming multiprocessor (SM) and cannot be migrated to other SMs in GPU (except during preemption, debugging, or CUDA dynamic parallelism).


2 Answers

Your approach should be workable. When you pass an object by value as a kernel parameter (as you have indicated) there really isn't much setup that needs to be done associated with the transfer from host to device.

You need to properly allocate data on the host and the device, and use cudaMemcpy type operations at appropriate points to move the data, just as you would in an ordinary CUDA program.

One thing to be aware of when declaring an object at global scope as you have done, is that it is recommended not to use CUDA API calls in the object's constructor or destructor. The reasons are covered here, I won't repeat them here. Although that treatment mostly focuses on kernels launched before main, the CUDA lazy initialization can also impact any CUDA API call that is executed outside of main scope, which applies to constructors and destructors of objects instantiated at global scope.

What follows is a fleshed out example from what you have shown. I mostly didn't change the code you had already written, just added some method definitions for the ones you hadn't. There's obviously a lot of different possible approaches here. For more examples you might want to look at the CUDA C++ integration sample code.

Here's a worked example around what you have shown:

$ cat t1236.cu
#include <cstdio>

class myClass
{
public:
        bool bool_var;    // Set from host and readable from device
        int  data_size;   // Set from host
        __host__ myClass();
        __host__ ~myClass();
        __host__ void setValues(bool iftrue, int size);
        __device__ void dosomething(int device_parameter);
        __host__ void export_data();

        // completely unknown methods
        __host__ void prepareDeviceObj();
        __host__ void retrieveDataToHost();
private:
        int *data; // Filled in device, shared between threads, at the end copied back to host for data output
        int *h_data;
};

__host__ myClass::myClass()
{
}

__host__ myClass::~myClass()
{
}

__host__ void myClass::prepareDeviceObj(){
        cudaMemcpy(data, h_data, data_size*sizeof(h_data[0]), cudaMemcpyHostToDevice);
}
__host__ void myClass::retrieveDataToHost(){
        cudaMemcpy(h_data, data, data_size*sizeof(h_data[0]), cudaMemcpyDeviceToHost);
}

__host__ void myClass::setValues(bool iftrue, int size)
{
        bool_var  = iftrue;
        data_size = size;
        cudaMalloc(&data, data_size*sizeof(data[0]));
        h_data = (int *)malloc(data_size*sizeof(h_data[0]));
        memset(h_data, 0, data_size*sizeof(h_data[0]));
}

__device__ void myClass::dosomething(int idx)
{
        int toadd = idx+data_size;
        atomicAdd(&(data[idx]), toadd); // data should be unique among threads
}
__host__ void myClass::export_data(){
        for (int i = 0; i < data_size; i++) printf("%d ", h_data[i]);
        printf("\n");
        cudaFree(data);
        free(h_data);
}


__global__ void myKernel(myClass obj)
{
        const int idx = blockIdx.x*blockDim.x + threadIdx.x;
        if(idx < obj.data_size)
        {
                if(!obj.bool_var)
                        printf("Object is not up to any task here!");
                else
                {
                        //printf("Object is ready!");
                        obj.dosomething(idx);
                }
        }
}


myClass globalInstance;

int main(int argc, char** argv)
{
        int some_number = 40;
        globalInstance.setValues(true, some_number);
        globalInstance.prepareDeviceObj();
        myKernel<<<1,some_number>>>(globalInstance);
        globalInstance.retrieveDataToHost();
        globalInstance.export_data();
        exit(EXIT_SUCCESS);
}
$ nvcc -o t1236 t1236.cu
$ cuda-memcheck ./t1236
========= CUDA-MEMCHECK
40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
========= ERROR SUMMARY: 0 errors
$
like image 99
Robert Crovella Avatar answered Nov 11 '22 06:11

Robert Crovella


What has worked best for me is to only put regular CUDA functions, kernels and kernel launches in CUDA C (.cu) files, and then build an object oriented interface on top of that, with classes in C++ (.cpp) files.

So, in your class constructor, you call functions in your .cu file that allocate and initialize memory and in your methods, you call functions that launch kernels, etc.

This can also make development turnaround faster because you can often change your classes without incurring recompile of .cu files, which is much slower than compiling pure .cpp files.

like image 28
Roger Dahl Avatar answered Nov 11 '22 07:11

Roger Dahl