How to implement handles for a CUDA driver API library?

Tags:

^{Note: The question has been updated to address the questions that have been raised in the comments, and to emphasize that the core of the question is about the interdependencies between the Runtime- and Driver API}

The CUDA runtime libraries (like CUBLAS or CUFFT) are generally using the concept of a "handle" that summarizes the state and context of such a library. The usage pattern is quite simple:

// Create a handle
cublasHandle_t handle;
cublasCreate(&handle);

// Call some functions, always passing in the handle as the first argument
cublasSscal(handle, ...);

// When done, destroy the handle
cublasDestroy(handle);

However, there are many subtle details about how these handles interoperate with Driver- and Runtime contexts and multiple threads and devices. The documentation lists several, scattered details about context handling:

The general description of contexts in the CUDA Programming Guide at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#context
The handling of multiple contexts, as described in the CUDA Best Practices Guide at http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#multiple-contexts
The context management differences between runtime and driver API, explained at http://docs.nvidia.com/cuda/cuda-driver-api/driver-vs-runtime-api.html
The general description of CUBLAS contexts/handles at http://docs.nvidia.com/cuda/cublas/index.html#cublas-context and their thread safety at http://docs.nvidia.com/cuda/cublas/index.html#thread-safety2

However, some of information seems to be not entirely up to date (for example, I think one should use cuCtxSetCurrent instead of cuCtxPushCurrent and cuCtxPopCurrent?), some of it seems to be from a time before the "Primary Context" handling was exposed via the driver API, and some parts are oversimplified in that they only show the most simple usage patterns, make only vague or incomplete statements about multithreading, or cannot be applied to the concept of "handles" that is used in the runtime libraries.

My goal is to implement a runtime library that offers its own "handle" type, and that allows usage patterns that are equivalent to the other runtime libraries in terms of context handling and thread safety.

For the case that the library can internally be implemented solely using the Runtime API, things may be clear: The context management is solely in the responsibility of the user. If he creates an own driver context, the rules that are stated in the documentation about the Runtime- and Driver context management will apply. Otherwise, the Runtime API functions will take care of the handling of primary contexts.

However, there may be the case that a library will internally have to use the Driver API. For example, in order to load PTX files as CUmodule objects, and obtain the CUfunction objects from them. And when the library should - for the user - behave like a Runtime library, but internally has to use the Driver API, some questions arise about how the context handling has to be implemented "under the hood".

What I have figured out so far is sketched here.

_{(It is "pseudocode" in that it omits the error checks and other details, and ... all this is supposed to be implemented in Java, but that should not be relevant here)}

1. The "Handle" is basically a class/struct containing the following information:

class Handle 
{
    CUcontext context;
    boolean usingPrimaryContext;
    CUdevice device;
}

2. When it is created, two cases have to be covered: It can be created when a driver context is current for the calling thread. In this case, it should use this context. Otherwise, it should use the primary context of the current (runtime) device:

Handle createHandle()
{
    cuInit(0);

    // Obtain the current context
    CUcontext context;
    cuCtxGetCurrent(&context);

    CUdevice device;

    // If there is no context, use the primary context
    boolean usingPrimaryContext = false;
    if (context == nullptr)
    {
        usingPrimaryContext = true;

        // Obtain the device that is currently selected via the runtime API
        int deviceIndex;
        cudaGetDevice(&deviceIndex);

        // Obtain the device and its primary context
        cuDeviceGet(&device, deviceIndex);
        cuDevicePrimaryCtxRetain(&context, device));
        cuCtxSetCurrent(context);
    }
    else
    {
        cuCtxGetDevice(device);
    }

    // Create the actual handle. This might internally allocate
    // memory or do other things that are specific for the context
    // for which the handle is created
    Handle handle = new Handle(device, context, usingPrimaryContext);
    return handle;
}

3. When invoking a kernel of the library, the context of the associated handle is made current for the calling thread:

void someLibraryFunction(Handle handle)
{
    cuCtxSetCurrent(handle.context);
    callMyKernel(...);
}

Here, one could argue that the caller is responsible for making sure that the required context is current. But if the handle was created for a primary context, then this context will be made current automatically.

4. When the handle is destroyed, this means that cuDevicePrimaryCtxRelease has to be called, but only when the context is a primary context:

void destroyHandle(Handle handle)
{
    if (handle.usingPrimaryContext)
    {
        cuDevicePrimaryCtxRelease(handle.device);
    }
}

From my experiments so far, this seems to expose the same behavior as a CUBLAS handle, for example. But my possibilities for thoroughly testing this are limited, because I only have a single device, and thus cannot test the crucial cases, e.g. of having two contexts, one for each of two devices.

So my questions are:

Are there any established patterns for implementing such a "Handle"?
Are there any usage patterns (e.g. with multiple devices and one context per device) that could not be covered with the approach that is sketched above, but would be covered with the "handle" implementations of CUBLAS?
More generally: Are there any recommendations of how to improve the current "Handle" implementation?
Rhetorical: Is the source code of the CUBLAS handle handling available somewhere?

(I also had a look at the context handling in tensorflow, but I'm not sure whether one can derive recommendations about how to implement handles for a runtime library from that...)

^{(An "Update" has been removed here, because it was added in response to the comments, and should no longer be relevant)}

362

asked Feb 08 '18 01:02

Marco13

1 Answers

_{I'm sorry I hadn't noticed this question sooner - as we might have collaborated on this somewhat. Also, it's not quite clear to me whether this question belongs here, on codereview.SX or on programmers.SX, but let's ignore all that.}

I have now done what you were aiming to do, and possibly more generally. So, I can offer both an example of what to do with "handles", and moreover, suggest the prospect of not having to implement this at all.

The library is an expanding of cuda-api-wrappers to also cover the Driver API and NVRTC; it is not yet release-grade, but it is in the testing phase, on this branch.

Now, to answer your concrete question:

Pattern for writing a class surrounding a raw "handle"

Are there any established patterns for implementing such a "Handle"?

Yes. If you read:

What is the difference between: Handle, Pointer and Reference

you'll notice a handle is defined as an "opaque reference to an object". It has some similarity to a pointer. A relevant pattern, therefore, is a variation on the PIMPL idiom: In regular PIMPL, you write an implementation class, and the outwards-facing class only holds a pointer to the implementation class and forwards method calls to it. When you have an opaque handle to an opaque object in some third-party library or driver - you use the handle to forward method calls to that implementation.

That means, that your outwards-facing class is not a handle, it represents the object to which you have a handle.

Generality and flexibility

Are there any usage patterns (e.g. with multiple devices and one context per device) that could not be covered with the approach that is sketched above, but would be covered with the "handle" implementations of CUBLAS?

I'm not sure what exactly CUBLAS does under the hood (and I have almost never used CUBLAS to be honest), but if it were well-designed and implemented, it would create its own context, and try to not to impinge on the rest of your code, i.e. it would alwas do:

Push our CUBLAS context onto the top of the stack
Do actual work
Pop the top of the context stack.

Your class doesn't do this.

More generally: Are there any recommendations of how to improve the current "Handle" implementation?

Yes:

Use RAII whenever it is possible and relevant. If your creation code allocates a resource (e.g. via the CUDA driver) - the destructor for the object you return should safely release those resources.
Allow for both reference-type and value-type use of Handles, i.e. it may be the handle I created, but it may also be a handle I got from somewhere else and isn't my responsibility. This is trivial if you leave it up to the user to release resources, but a bit tricky if you take that responsibility
You assume that if there's any current context, that's the one your handle needs to use. Says who? At the very least, let the user pass a context in if they want to.
Avoid writing the low-level parts of this on your own unless you really must. You are quite likely to miss some things (the push-and-pop is not the only thing you might be missing), and you're repeating a lot of work that is actually generic and not specific to your application or library. I may be biased here, but you can now use nice, RAII-ish, wrappers for CUDA contexts, streams, modules, devices etc. without even known about raw handles for anything.

Rhetorical: Is the source code of the CUBLAS handle handling available somewhere?

To the best of my knowledge, NVIDIA hasn't released it.

103

answered Oct 09 '22 17:10

einpoklum

Related questions
                            
                                Problems when running nvcc from command line
                            
                                Matrix multiplication on CPU (numpy) and GPU (gnumpy) give different results
                            
                                How is 2D Shared Memory arranged in CUDA
                            
                                CUDA allocate memory in __device__ function
                            
                                How to run CUDA without a GPU using a software implementation?
                            
                                How to Run a cuda code using remote Desktop?
                            
                                CUDA version X complains about not supporting gcc version Y - what to do?
                            
                                CUDA exp() expf() and __expf()
                            
                                How do I enable syntax highlighting of CUDA .cu files in Visual Studio 2010?
                            
                                Installing cuda 5 samples in Ubuntu 12.10
                            
                                Block reduction in CUDA
                            
                                Cuda Image average filter
                            
                                How to enable syntax highlighting for CUDA .cu and .cuh files in Vim?
                            
                                Random Number Generator in CUDA
                            
                                Error in cudaMemcpyToSymbol using CUDA 5
                            
                                Is there a way of setting default value for shared memory array?
                            
                                Cuda C - Linker error - undefined reference
                            
                                Accessing OpenCV CUDA Functions from Python (No PyCUDA)
                            
                                How to call a host function in a CUDA kernel?
                            
                                128 bit integer on cuda?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to implement handles for a CUDA driver API library?

Tags:

cuda

api-design

cuda-context

jcuda