Which is the best way of using constants in CUDA? One way is to define constants in constant memory, like: <pre class="prettyprint"><code>// CUDA global constants __constant__ int M; int main(void) { ... cudaMemcpyToSymbol("M", &M, sizeof(M)); ... } </code></pre> An alterative way would be to use the C preprocessor: <pre class="prettyprint"><code>#define M = ... </code></pre> I would think defining constants with the C preprocessor is much faster. Which are then the benefits of using the constant memory on a CUDA device?

<ol> <li>constants that are known at compile time should be defined using preprocessor macros (e.g. <code>#define</code>) or via C/C++ <code>const</code> variables at global/file scope.</li> <li>Usage of <code>__constant__</code> memory may be beneficial for programs who use certain values that don't change for the duration of the kernel and for which certain access patterns are present (e.g. all threads access the same value at the same time). This is not better or faster than constants that satisfy the requirements of item 1 above.</li> <li>If the number of choices to be made by a program are relatively small in number, and these choices affect kernel execution, one possible approach for additional compile-time optimization would be to use templated code/kernels </li> </ol>

Regular C/C++ style constants: In CUDA C (itself a modification of C99) constants are absolute compile time entities. This is hardly surprising given the amount of optimization that happens in NVCC is VERY involved given the nature of GPU processing. <code>#define</code>: macros are as always very inelegant but useful in a pinch. The <code>__constant__</code> variable specifier is, however a completely new animal and something of a misnomer in my opinion. I will put down what Nvidia has here in the space below: <blockquote> The <code>__constant__</code> qualifier, optionally used together with <code>__device__</code>, declares a variable that: <ul> <li>Resides in constant memory space,</li> <li>Has the lifetime of an application,</li> <li>Is accessible from all the threads within the grid and from the host through the runtime library (cudaGetSymbolAddress() / cudaGetSymbolSize() / cudaMemcpyToSymbol() / cudaMemcpyFromSymbol()).</li> </ul> </blockquote> Nvidia's documentation specifies that <code>__constant__</code> is available at register level speed (near-zero latency) provided it is the same constant being accessed by all threads of a warp. They are declared at global scope in CUDA code. HOWEVER based on personal (and currently ongoing) experience you have to be careful with this specifier when it comes to separate compilation, like separating your CUDA code (.cu and .cuh files) from your C/C++ code by putting wrapper functions in C-style headers. Unlike traditional "constant" specified variables however these are initialized at runtime fromthe host code that allocates device memory and ultimately launches the kernel. I repeat I am currently working code that demonstrates these can be set at runtime using cudaMemcpyToSymbol() before kernel execution. They are quite handy to say the least given the L1 cache level speed that is guaranteed for access.

Using constants with CUDA

Tags:

c

constants

cuda

nvidia

Which is the best way of using constants in CUDA?

One way is to define constants in constant memory, like:

// CUDA global constants
__constant__ int M;

int main(void)
{
    ...
    cudaMemcpyToSymbol("M", &M, sizeof(M));
    ...
}

An alterative way would be to use the C preprocessor:

#define M = ...

I would think defining constants with the C preprocessor is much faster. Which are then the benefits of using the constant memory on a CUDA device?

860

asked Apr 20 '13 11:04

jrsm

2 Answers

constants that are known at compile time should be defined using preprocessor macros (e.g. #define) or via C/C++ const variables at global/file scope.
Usage of __constant__ memory may be beneficial for programs who use certain values that don't change for the duration of the kernel and for which certain access patterns are present (e.g. all threads access the same value at the same time). This is not better or faster than constants that satisfy the requirements of item 1 above.
If the number of choices to be made by a program are relatively small in number, and these choices affect kernel execution, one possible approach for additional compile-time optimization would be to use templated code/kernels

answered Sep 28 '22 07:09

Robert Crovella

Regular C/C++ style constants: In CUDA C (itself a modification of C99) constants are absolute compile time entities. This is hardly surprising given the amount of optimization that happens in NVCC is VERY involved given the nature of GPU processing.

#define: macros are as always very inelegant but useful in a pinch.

The __constant__ variable specifier is, however a completely new animal and something of a misnomer in my opinion. I will put down what Nvidia has here in the space below:

The __constant__ qualifier, optionally used together with __device__, declares a variable that:

Resides in constant memory space,

Has the lifetime of an application,

Is accessible from all the threads within the grid and from the host through the runtime library (cudaGetSymbolAddress() / cudaGetSymbolSize() / cudaMemcpyToSymbol() / cudaMemcpyFromSymbol()).

Nvidia's documentation specifies that __constant__ is available at register level speed (near-zero latency) provided it is the same constant being accessed by all threads of a warp.

They are declared at global scope in CUDA code. HOWEVER based on personal (and currently ongoing) experience you have to be careful with this specifier when it comes to separate compilation, like separating your CUDA code (.cu and .cuh files) from your C/C++ code by putting wrapper functions in C-style headers.

Unlike traditional "constant" specified variables however these are initialized at runtime fromthe host code that allocates device memory and ultimately launches the kernel. I repeat I am currently working code that demonstrates these can be set at runtime using cudaMemcpyToSymbol() before kernel execution.

They are quite handy to say the least given the L1 cache level speed that is guaranteed for access.

answered Sep 28 '22 07:09

opetrenko

Related questions
                            
                                Error message "undefined reference for `CPU_ZERO'"
                            
                                Does int 80h interrupt a kernel process?
                            
                                Trying to run a cross-compiled executable on target device fails with: No such file or directory
                            
                                Why does a linux compiled program not work on Windows
                            
                                BlueZ D-Bus C or C++ Sample [closed]
                            
                                Include multiple header-files at once with only one #include-expression?
                            
                                Steps to make a LED blink from a C/C++ program?
                            
                                Function pointers and callbacks in C
                            
                                Any way to use HTML as an interface to a C/C++ program?
                            
                                comma separated expression in while loop in C
                            
                                The shortest way to convert infix expressions to postfix (RPN) in C
                            
                                Does fseek() move the file pointer to the beginning of the file if it was opened in "a+b" mode?
                            
                                Memory map for a 2D array in C
                            
                                Can we assign a value to a given memory location?
                            
                                ncurses multi colors on screen
                            
                                One makefile for two compilers
                            
                                Class methods VS Class static functions VS Simple functions - Performance-wise?
                            
                                strcat() for formatted strings
                            
                                Why is C <stdio.h> FILE* fread() faster than Win32 ReadFile()?
                            
                                Why do { } while(condition); needs semicolon at the end of it but while(condition) {} doesn't? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With