When to use volatile with shared CUDA Memory

Tags:

Under what circumstances should you use the volatile keyword with a CUDA kernel's shared memory? I understand that volatile tells the compiler never to cache any values, but my question is about the behavior with a shared array:

__shared__ float products[THREADS_PER_ACTION];

// some computation
products[threadIdx.x] = localSum;

// wait for everyone to finish their computation
__syncthreads();

// then a (basic, ugly) reduction:
if (threadIdx.x == 0) {
    float globalSum = 0.0f;
    for (i = 0; i < THREADS_PER_ACTION; i++)
        globalSum += products[i];
}

Do I need products to be volatile in this case? Each array entry is only accessed by a single thread, except at the end, where everything is read by thread 0. Is it possible that the compiler could cache the entire array, and so I need it to be volatile, or will it only cache elements?

Thanks!

393

asked Mar 11 '13 04:03

Taj Morton

1 Answers

If you don't declare a shared array as volatile, then the compiler is free to optimize locations in shared memory by locating them in registers (whose scope is specific to a single thread), for any thread, at it's choosing. This is true whether you access that particular shared element from only one thread or not. Therefore, if you use shared memory as a communication vehicle between threads of a block, it's best to declare it volatile. However, this sort of communication pattern often also requires execution barriers to enforce ordering of reads/writes, so continue reading about barriers below.

Obviously if each thread only accessed its own elements of shared memory, and never those associated with another thread, then this does not matter, and the compiler optimization will not break anything.

In your case, where you have a section of code where each thread is accessing it's own elements of shared memory, and the only inter-thread access occurs at a well understood location, you could use a memory fence function to force the compiler to evict any values that are temporarily stored in registers, back out to the shared array. So you might think that __threadfence_block() might be useful, but in your case, __syncthreads() already has memory-fencing functionality built in. So your __syncthreads() call is sufficient to force thread synchronization as well as to force any register-cached values in shared memory to be evicted back to shared memory.

By the way, if that reduction at the end of your code is of performance concern, you could consider using a parallel reduction method to speed it up.

137

answered Oct 26 '22 23:10

Robert Crovella

Related questions
                            
                                Handling calls to (potentially) far away ahead-of-time compiled functions from JITed code
                            
                                JavaScript object code caching: which of these assertions are wrong?
                            
                                Difference in inlining functions by compiler or linker?
                            
                                Can I let the C++ compiler decide whether to pass-by-value or pass-by-reference?
                            
                                Is there a g++ equivalent to Visual Studio's __declspec(novtable)?
                            
                                Are the optimizations done in LTO the same as in normal compilation?
                            
                                What is the A-Normal Form?
                            
                                LLVM (3.5+) PassManager vs LegacyPassManager
                            
                                Why can't I access static final members from a dedicated enum value in Java
                            
                                gcc difference between -pthread and -pthreads?
                            
                                Why is /Gm the default option in debug configuration as opposed to /MP?
                            
                                Creating a DLL in GCC or Cygwin?
                            
                                C# error when class shares name with namespace
                            
                                Generating source maps for multiple concatenated javascript files compiled from Coffeescript
                            
                                GCC generate Canary or not?
                            
                                what is/are the purpose(s) of inline?
                            
                                How to change a compiler flag for just one executable in CMake?
                            
                                What are C# lambda's compiled into? A stackframe, an instance of an anonymous type, or?
                            
                                Java generics compile in Eclipse, but not in javac
                            
                                What happens if compiler inlines a function which is called through a function pointer

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

When to use volatile with shared CUDA Memory

Tags:

volatile

cuda

compiler-construction

gpgpu

gpu

Taj Morton

People also ask

1 Answers

Robert Crovella

Recent Activity

Donate For Us