Lets say I want a CUDA kernel that needs to do lots of stuff, but there are dome parameters that are constant to all the kernels. this arguments are passed to the main program as an input, so they can not be defined in a <code>#DEFINE</code>. The kernel will run multiple times (around 65K) and it needs those parameters (and some other inputs) to do its maths. My question is: whats the fastest (or else, the most elegant) way of passing these constants to the kernels? The constants are 2 or 3 element length <code>float*</code> or <code>int*</code> arrays. They will be around 5~10 of these. <hr> toy example: 2 constants <code>const1</code> and <code>const2</code> <pre class="prettyprint"><code>__global__ void kernelToyExample(int inputdata, ?????){ value=inputdata*const1[0]+const2[1]/const1[2]; } </code></pre> is it better <pre class="prettyprint"><code>__global__ void kernelToyExample(int inputdata, float* const1, float* const2){ value=inputdata*const1[0]+const2[1]/const1[2]; } </code></pre> or <pre class="prettyprint"><code>__global__ void kernelToyExample(int inputdata, float const1x, float const1y, float const1z, float const2x, float const2y){ value=inputdata*const1x+const2y/const1z; } </code></pre> or maybe declare them in some global read only memory and let the kernels read from there? If so, L1, L2, global? Which one? Is there a better way I don't know of? Running on a Tesla K40.

Just pass them by value. The compiler will automagically put them in the optimal place to facilitate cached broadcast to all threads in each block - either shared memory in compute capability 1.x devices, or constant memory/constant cache in compute capability >= 2.0 devices. For example, if you had a long list of arguments to pass to the kernel, a struct passed by value is a clean way to go: <pre class="prettyprint"><code>struct arglist { float magicfloat_1; float magicfloat_2; //...... float magicfloat_19; int magicint1; //...... }; __global__ void kernel(...., const arglist args) { // you get the idea } </code></pre> [standard disclaimer: written in browser, not real code, caveat emptor] If it turned out one of your <code>magicint</code> actually only took one of a small number of values which you know beforehand, then templating is an extremely powerful tool: <pre class="prettyprint"><code>template<int magiconstant1> __global__ void kernel(....) { for(int i=0; i < magconstant1; ++i) { // ..... } } template kernel<3>(....); template kernel<4>(....); template kernel<5>(....); </code></pre> The compiler is smart enough to recognise <code>magconstant</code> makes the loop trip known at compile time and will automatically unroll the loop for you. Templating is a very powerful technique for building fast, flexible codebases and you would be well advised to accustom yourself with it if you haven't already done so.

Fastest (or most elegant) way of passing constant arguments to a CUDA kernel

Tags:

c++

cuda

Lets say I want a CUDA kernel that needs to do lots of stuff, but there are dome parameters that are constant to all the kernels. this arguments are passed to the main program as an input, so they can not be defined in a #DEFINE.

The kernel will run multiple times (around 65K) and it needs those parameters (and some other inputs) to do its maths.

My question is: whats the fastest (or else, the most elegant) way of passing these constants to the kernels?

The constants are 2 or 3 element length float* or int* arrays. They will be around 5~10 of these.

toy example: 2 constants const1 and const2

__global__ void kernelToyExample(int inputdata, ?????){
        value=inputdata*const1[0]+const2[1]/const1[2];
}

is it better

__global__ void kernelToyExample(int inputdata, float* const1, float* const2){
        value=inputdata*const1[0]+const2[1]/const1[2];
}

__global__ void kernelToyExample(int inputdata, float const1x, float const1y, float const1z, float const2x, float const2y){
        value=inputdata*const1x+const2y/const1z;
}

or maybe declare them in some global read only memory and let the kernels read from there? If so, L1, L2, global? Which one?

Is there a better way I don't know of?

Running on a Tesla K40.

422

asked Jul 22 '15 16:07

Ander Biguri

1 Answers

Just pass them by value. The compiler will automagically put them in the optimal place to facilitate cached broadcast to all threads in each block - either shared memory in compute capability 1.x devices, or constant memory/constant cache in compute capability >= 2.0 devices.

For example, if you had a long list of arguments to pass to the kernel, a struct passed by value is a clean way to go:

struct arglist {
    float magicfloat_1;
    float magicfloat_2;
    //......
    float magicfloat_19;
    int magicint1;
    //......
};

__global__ void kernel(...., const arglist args)
{
    // you get the idea
}

[standard disclaimer: written in browser, not real code, caveat emptor]

If it turned out one of your magicint actually only took one of a small number of values which you know beforehand, then templating is an extremely powerful tool:

template<int magiconstant1>
__global__ void kernel(....)
{
    for(int i=0; i < magconstant1; ++i) {
       // .....
    }
}

template kernel<3>(....);
template kernel<4>(....);
template kernel<5>(....);

The compiler is smart enough to recognise magconstant makes the loop trip known at compile time and will automatically unroll the loop for you. Templating is a very powerful technique for building fast, flexible codebases and you would be well advised to accustom yourself with it if you haven't already done so.

177

answered Oct 24 '22 04:10

3 revs

Related questions
                            
                                C++ Declaration and instantiation of scoped variable with curly braces instead of assignment operator
                            
                                Why is `explicit` not compatible with `virtual`?
                            
                                Visual Studio 2010 Professional: Cannot find include file "new.h"
                            
                                LLVM Cast Instructions
                            
                                Visual Studio 2013: CL.exe exited with code -1073741515
                            
                                Why do we need to set rvalue reference to null in move constructor?
                            
                                Listening to keyboard events without consuming them in X11 - Keyboard hooking
                            
                                Do parentheses force order of evaluation and make an undefined expression defined?
                            
                                Global keyboard hook with WH_KEYBOARD_LL and keybd_event (windows)
                            
                                Is there way to set stdout to binary mode?
                            
                                Floating Point, is an equality comparison enough to prevent division by zero?
                            
                                Incomplete type for std::vector
                            
                                FindFirstFile undocumented wildcard or bug?
                            
                                Destructor called when objects are passed by value
                            
                                How to get reference count of a PyObject?
                            
                                Setup GDB with QtCreator
                            
                                Qt 4.7: TCP thread, data transfer causes memory leak
                            
                                Atomic writing to file on linux
                            
                                A variadic template method to accept a given number of doubles?
                            
                                Zero in double vs char* ambiguity

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With