I'm starting to use CUDA at the moment and have to admit that I'm a bit disappointed with the C API. I understand the reasons for choosing C but had the language been based on C++ instead, several aspects would have been a lot simpler, e.g. device memory allocation (via <code>cudaMalloc</code>). My plan was to do this myself, using overloaded <code>operator new</code> with placement <code>new</code> and RAII (two alternatives). I'm wondering if there are any caveats that I haven't noticed so far. The code seems to work but I'm still wondering about potential memory leaks. The usage of the RAII code would be as follows: <pre class="prettyprint"><code>CudaArray<float> device_data(SIZE); // Use `device_data` as if it were a raw pointer. </code></pre> Perhaps a class is overkill in this context (especially since you'd still have to use <code>cudaMemcpy</code>, the class only encapsulating RAII) so the other approach would be placement <code>new</code>: <pre class="prettyprint"><code>float* device_data = new (cudaDevice) float[SIZE]; // Use `device_data` … operator delete [](device_data, cudaDevice); </code></pre> Here, <code>cudaDevice</code> merely acts as a tag to trigger the overload. However, since in normal placement <code>new</code> this would indicate the placement, I find the syntax oddly consistent and perhaps even preferable to using a class. I'd appreciate criticism of every kind. Does somebody perhaps know if something in this direction is planned for the next version of CUDA (which, as I've heard, will improve its C++ support, whatever they mean by that). So, my question is actually threefold: <ol> <li>Is my placement <code>new</code> overload semantically correct? Does it leak memory?</li> <li>Does anybody have information about future CUDA developments that go in this general direction (let's face it: C interfaces in C++ s*ck)?</li> <li>How can I take this further in a consistent manner (there are other APIs to consider, e.g. there's not only device memory but also a constant memory store and texture memory)? </li> </ol> <hr> <pre class="prettyprint"><code>// Singleton tag for CUDA device memory placement. struct CudaDevice { static CudaDevice const& get() { return instance; } private: static CudaDevice const instance; CudaDevice() { } CudaDevice(CudaDevice const&); CudaDevice& operator =(CudaDevice const&); } const& cudaDevice = CudaDevice::get(); CudaDevice const CudaDevice::instance; inline void* operator new [](std::size_t nbytes, CudaDevice const&) { void* ret; cudaMalloc(&ret, nbytes); return ret; } inline void operator delete [](void* p, CudaDevice const&) throw() { cudaFree(p); } template <typename T> class CudaArray { public: explicit CudaArray(std::size_t size) : size(size), data(new (cudaDevice) T[size]) { } operator T* () { return data; } ~CudaArray() { operator delete [](data, cudaDevice); } private: std::size_t const size; T* const data; CudaArray(CudaArray const&); CudaArray& operator =(CudaArray const&); }; </code></pre> About the singleton employed here: Yes, I'm aware of its drawbacks. However, these aren't relevant in this context. All I needed here was a small type tag that wasn't copyable. Everything else (i.e. multithreading considerations, time of initialization) don't apply.

I would go with the placement new approach. Then I would define a class that conforms to the std::allocator<> interface. In theory, you could pass this class as a template parameter into std::vector<> and std::map<> and so forth. Beware, I have heard that doing such things is fraught with difficulty, but at least you will learn a lot more about the STL this way. And you do not need to re-invent your containers and algorithms.

<blockquote> Does anybody have information about future CUDA developments that go in this general direction (let's face it: C interfaces in C++ s*ck)? </blockquote> Yes, I've done something like that: https://github.com/eyalroz/cuda-api-wrappers/ <blockquote> nVIDIA's Runtime API for CUDA is intended for use both in C and C++ code. As such, it uses a C-style API, the lower common denominator (with a few notable exceptions of templated function overloads). This library of wrappers around the Runtime API is intended to allow us to embrace many of the features of C++ (including some C++11) for using the runtime API - but without reducing expressivity or increasing the level of abstraction (as in, e.g., the Thrust library). Using cuda-api-wrappers, you still have your devices, streams, events and so on - but they will be more convenient to work with in more C++-idiomatic ways. </blockquote>

CUDA: Wrapping device memory allocation in C++

Tags:

c++

cuda

placement-new

raii

I'm starting to use CUDA at the moment and have to admit that I'm a bit disappointed with the C API. I understand the reasons for choosing C but had the language been based on C++ instead, several aspects would have been a lot simpler, e.g. device memory allocation (via cudaMalloc).

My plan was to do this myself, using overloaded operator new with placement new and RAII (two alternatives). I'm wondering if there are any caveats that I haven't noticed so far. The code seems to work but I'm still wondering about potential memory leaks.

The usage of the RAII code would be as follows:

CudaArray<float> device_data(SIZE);
// Use `device_data` as if it were a raw pointer.

Perhaps a class is overkill in this context (especially since you'd still have to use cudaMemcpy, the class only encapsulating RAII) so the other approach would be placement new:

float* device_data = new (cudaDevice) float[SIZE];
// Use `device_data` …
operator delete [](device_data, cudaDevice);

Here, cudaDevice merely acts as a tag to trigger the overload. However, since in normal placement new this would indicate the placement, I find the syntax oddly consistent and perhaps even preferable to using a class.

I'd appreciate criticism of every kind. Does somebody perhaps know if something in this direction is planned for the next version of CUDA (which, as I've heard, will improve its C++ support, whatever they mean by that).

So, my question is actually threefold:

Is my placement new overload semantically correct? Does it leak memory?
Does anybody have information about future CUDA developments that go in this general direction (let's face it: C interfaces in C++ s*ck)?
How can I take this further in a consistent manner (there are other APIs to consider, e.g. there's not only device memory but also a constant memory store and texture memory)?

// Singleton tag for CUDA device memory placement.
struct CudaDevice {
    static CudaDevice const& get() { return instance; }
private:
    static CudaDevice const instance;
    CudaDevice() { }
    CudaDevice(CudaDevice const&);
    CudaDevice& operator =(CudaDevice const&);
} const& cudaDevice = CudaDevice::get();

CudaDevice const CudaDevice::instance;

inline void* operator new [](std::size_t nbytes, CudaDevice const&) {
    void* ret;
    cudaMalloc(&ret, nbytes);
    return ret;
}

inline void operator delete [](void* p, CudaDevice const&) throw() {
    cudaFree(p);
}

template <typename T>
class CudaArray {
public:
    explicit
    CudaArray(std::size_t size) : size(size), data(new (cudaDevice) T[size]) { }

    operator T* () { return data; }

    ~CudaArray() {
        operator delete [](data, cudaDevice);
    }

private:
    std::size_t const size;
    T* const data;

    CudaArray(CudaArray const&);
    CudaArray& operator =(CudaArray const&);
};

About the singleton employed here: Yes, I'm aware of its drawbacks. However, these aren't relevant in this context. All I needed here was a small type tag that wasn't copyable. Everything else (i.e. multithreading considerations, time of initialization) don't apply.

872

asked Nov 18 '08 18:11

Konrad Rudolph

3 Answers

In the meantime there were some further developments (not so much in terms of the CUDA API, but at least in terms of projects attempting an STL-like approach to CUDA data management).

Most notably there is a project from NVIDIA research: thrust

122

answered Oct 19 '22 11:10

kynan

I would go with the placement new approach. Then I would define a class that conforms to the std::allocator<> interface. In theory, you could pass this class as a template parameter into std::vector<> and std::map<> and so forth.

Beware, I have heard that doing such things is fraught with difficulty, but at least you will learn a lot more about the STL this way. And you do not need to re-invent your containers and algorithms.

answered Oct 19 '22 13:10

coryan

Does anybody have information about future CUDA developments that go in this general direction (let's face it: C interfaces in C++ s*ck)?

Yes, I've done something like that:

https://github.com/eyalroz/cuda-api-wrappers/

nVIDIA's Runtime API for CUDA is intended for use both in C and C++ code. As such, it uses a C-style API, the lower common denominator (with a few notable exceptions of templated function overloads).

This library of wrappers around the Runtime API is intended to allow us to embrace many of the features of C++ (including some C++11) for using the runtime API - but without reducing expressivity or increasing the level of abstraction (as in, e.g., the Thrust library). Using cuda-api-wrappers, you still have your devices, streams, events and so on - but they will be more convenient to work with in more C++-idiomatic ways.

answered Oct 19 '22 11:10

einpoklum

Related questions
                            
                                UML representation for C/C++ function pointers
                            
                                Using C++11 lambdas asynchronously, safely
                            
                                Warn if accessing moved object in C++11 [duplicate]
                            
                                Overloading conversion operator template
                            
                                Move ownership from std::shared_ptr to std::unique_ptr
                            
                                Compile-time population of data structures other than arrays?
                            
                                Why is the move constructor neither declared nor deleted with clang?
                            
                                Why doesn't explicit bool() conversion happen in contextual conversion
                            
                                How slow or fast is Qt mobile for android
                            
                                How do I return a non-movable (but copyable) object?
                            
                                Function-to-function-pointer "decay"
                            
                                Is circumventing a class' constructor legal or does it result in undefined behaviour?
                            
                                C++ Volatile Placement New
                            
                                C++ error: terminate called after throwing an instance of 'std::bad_alloc'
                            
                                What's {} in void({})?
                            
                                Is the move constructor called after invoking a conversion function?
                            
                                "uses of target_link_libraries must be either all-keyword or all-plain"
                            
                                Why do these two pieces of code using constexpr, __PRETTY_FUNCTION__ and char * have different results?
                            
                                Making python generator via c++20 coroutines
                            
                                Can `this` be changed in a mutable lambda?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With