Accessing class data members from within cuda kernel - how to design proper host/device interaction?

Tags:

I've been trying to transform some cuda/C code into a more OO code, but my goal doesn't seem to be easy to achieve for my current understanding of the cuda functioning mechanism. I haven't been able to find good a explanation either on this situation. It might not be possible after all.

I have a global object of class myClass holding an array to be filled in a kernel.

How should the methods in myClass be defined so that the array and boolean members are visible from device and the array can then be copied back to host? I am using cuda 7.5 and the compute capability of my card is 3.5.

This is a tentative structure describing the situation:

#include <cstdio>
#include <cuda.h>
#include <cuda_runtime.h>

class myClass
{
public:
        bool bool_var;    // Set from host and readable from device
        int  data_size;   // Set from host
        __device__ __host__ myClass();
        __device__ __host__ ~myClass();
        __host__ void setValues(bool iftrue, int size);
        __device__ void dosomething(int device_parameter);
        __host__ void export();

        // completely unknown methods
        __host__ void prepareDeviceObj();
        __host__ void retrieveDataToHost();
private:
        int *data; // Filled in device, shared between threads, at the end copied back to host for data output
};

__host__ __device__ myClass::myClass()
{
}

__host__ __device__ myClass::~myClass()
{
#ifdef __CUDACC__
        if(bool_var)
                cudaFree(data);
#else
        free(data);
#endif
}

__host__ void myClass::setValues(bool iftrue, int size)
{
        bool_var  = iftrue;
        data_size = size;
}

__device__ void myClass::dosomething(int idx)
{
        int toadd = idx+data_size;
        atomicAdd(&data[idx], toadd); // data should be unique among threads
}


__global__ void myKernel(myClass obj)
{
        const int idx = blockIdx.x*blockDim.x + threadIdx.x;
        if(idx < obj.data_size)
        {
                if(!obj.bool_var)
                        printf("Object is not up to any task here!");
                else
                {
                        printf("Object is ready!");
                        obj.dosomething(idx);
                }
        }
}


myClass globalInstance;

int main(int argc, char** argv)
{
        int some_number = 40;
        globalInstance.setValues(true, some_number);
        globalInstance.prepareDeviceObj();           // unknown
        myKernel<<<1,some_number>>>(globalInstance); // how to pass the object?
        globalInstance.retrieveDataToHost();         // unknown
        globalInstance.export();
        exit(EXIT_SUCCESS);
}

364

asked Aug 17 '16 21:08

learningProcess

2 Answers

Your approach should be workable. When you pass an object by value as a kernel parameter (as you have indicated) there really isn't much setup that needs to be done associated with the transfer from host to device.

You need to properly allocate data on the host and the device, and use cudaMemcpy type operations at appropriate points to move the data, just as you would in an ordinary CUDA program.

One thing to be aware of when declaring an object at global scope as you have done, is that it is recommended not to use CUDA API calls in the object's constructor or destructor. The reasons are covered here, I won't repeat them here. Although that treatment mostly focuses on kernels launched before main, the CUDA lazy initialization can also impact any CUDA API call that is executed outside of main scope, which applies to constructors and destructors of objects instantiated at global scope.

What follows is a fleshed out example from what you have shown. I mostly didn't change the code you had already written, just added some method definitions for the ones you hadn't. There's obviously a lot of different possible approaches here. For more examples you might want to look at the CUDA C++ integration sample code.

Here's a worked example around what you have shown:

$ cat t1236.cu
#include <cstdio>

class myClass
{
public:
        bool bool_var;    // Set from host and readable from device
        int  data_size;   // Set from host
        __host__ myClass();
        __host__ ~myClass();
        __host__ void setValues(bool iftrue, int size);
        __device__ void dosomething(int device_parameter);
        __host__ void export_data();

        // completely unknown methods
        __host__ void prepareDeviceObj();
        __host__ void retrieveDataToHost();
private:
        int *data; // Filled in device, shared between threads, at the end copied back to host for data output
        int *h_data;
};

__host__ myClass::myClass()
{
}

__host__ myClass::~myClass()
{
}

__host__ void myClass::prepareDeviceObj(){
        cudaMemcpy(data, h_data, data_size*sizeof(h_data[0]), cudaMemcpyHostToDevice);
}
__host__ void myClass::retrieveDataToHost(){
        cudaMemcpy(h_data, data, data_size*sizeof(h_data[0]), cudaMemcpyDeviceToHost);
}

__host__ void myClass::setValues(bool iftrue, int size)
{
        bool_var  = iftrue;
        data_size = size;
        cudaMalloc(&data, data_size*sizeof(data[0]));
        h_data = (int *)malloc(data_size*sizeof(h_data[0]));
        memset(h_data, 0, data_size*sizeof(h_data[0]));
}

__device__ void myClass::dosomething(int idx)
{
        int toadd = idx+data_size;
        atomicAdd(&(data[idx]), toadd); // data should be unique among threads
}
__host__ void myClass::export_data(){
        for (int i = 0; i < data_size; i++) printf("%d ", h_data[i]);
        printf("\n");
        cudaFree(data);
        free(h_data);
}


__global__ void myKernel(myClass obj)
{
        const int idx = blockIdx.x*blockDim.x + threadIdx.x;
        if(idx < obj.data_size)
        {
                if(!obj.bool_var)
                        printf("Object is not up to any task here!");
                else
                {
                        //printf("Object is ready!");
                        obj.dosomething(idx);
                }
        }
}


myClass globalInstance;

int main(int argc, char** argv)
{
        int some_number = 40;
        globalInstance.setValues(true, some_number);
        globalInstance.prepareDeviceObj();
        myKernel<<<1,some_number>>>(globalInstance);
        globalInstance.retrieveDataToHost();
        globalInstance.export_data();
        exit(EXIT_SUCCESS);
}
$ nvcc -o t1236 t1236.cu
$ cuda-memcheck ./t1236
========= CUDA-MEMCHECK
40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
========= ERROR SUMMARY: 0 errors
$

answered Nov 11 '22 06:11

Robert Crovella

What has worked best for me is to only put regular CUDA functions, kernels and kernel launches in CUDA C (.cu) files, and then build an object oriented interface on top of that, with classes in C++ (.cpp) files.

So, in your class constructor, you call functions in your .cu file that allocate and initialize memory and in your methods, you call functions that launch kernels, etc.

This can also make development turnaround faster because you can often change your classes without incurring recompile of .cu files, which is much slower than compiling pure .cpp files.

answered Nov 11 '22 07:11

Roger Dahl

Related questions
                            
                                Function Overloading in C++ (const pointers)
                            
                                How can I declare a destructor in SystemC?
                            
                                String to LPCWSTR in c++
                            
                                pthread_create - invalid use of non-static member function [duplicate]
                            
                                Proper way to move unique_ptr array into another
                            
                                Memory access comparison
                            
                                Widget inside another widget Qt
                            
                                How does the compilation of a class work?
                            
                                Use C++ catch framework to verify assert statement
                            
                                Qt QString to QByteArray and back
                            
                                Build applications for both Windows 7 and Windows 10 with Visual Studio 2015
                            
                                Compiling Basic C++ Class
                            
                                Can the compiler decide the noexcept'ness of a function?
                            
                                Test if calling f(x) is possible using metaprogramming
                            
                                How do I set the value of an input tensor in c++?
                            
                                Does MS-specific volatile prevent hardware instructions reordering
                            
                                Is there anyway to use for loop with string macro?
                            
                                How to share file delete privilege when I opening a file by ifstream
                            
                                Representing a uniform distribution over the range of an arbitrary enum type
                            
                                Expected behavior on out-of-range template parameters?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With