CUDA: When to use shared memory and when to rely on L1 caching?

Q: Why is shared memory faster Cuda?

Summary. Shared memory is a powerful feature for writing well optimized CUDA code. Access to shared memory is much faster than global memory access because it is located on chip. Because shared memory is shared by threads in a thread block, it provides a mechanism for threads to cooperate.

Q: Is L1 cache faster than main memory?

CPUs often have a data cache, an instruction cache (for code), and a unified cache (for anything). Accessing these caches are much faster than accessing the RAM: Typically, the L1 cache is about 100 times faster than the RAM for data access, and the L2 cache is 25 times faster than RAM for data access.

Q: What are the advantages of L1 cache over L2 cache?

The different between L1 and L2 cacheL1 has a smaller memory capacity than L2. Also, L1 can be accessed faster than L2. L1 is usually in-built to the chip, while L2 is soldered on the motherboard very close to the chip. Therefore, L1 has a very little delay compared to L2.

Q: Is shared memory cached?

By default, all memory loads from global memory are cached in L1. The target location for the global memory load has no effect on the L1 caching (whether it is a register, or shared memory or thread local memory). The shared memory itself is obviously not cached.

Tags:

After Compute Capability 2.0 (Fermi) was released, I've wondered if there are any use cases left for shared memory. That is, when is it better to use shared memory than just let L1 perform its magic in the background?

Is shared memory simply there to let algorithms designed for CC < 2.0 run efficiently without modifications?

To collaborate via shared memory, threads in a block write to shared memory and synchronize with __syncthreads(). Why not simply write to global memory (through L1), and synchronize with __threadfence_block()? The latter option should be easier to implement since it doesn't have to relate to two different locations of values, and it should be faster because there is no explicit copying from global to shared memory. Since the data gets cached in L1, threads don't have to wait for data to actually make it all the way out to global memory.

With shared memory, one is guaranteed that a value that was put there remains there throughout the duration of the block. This is as opposed to values in L1, which get evicted if they are not used often enough. Are there any cases where it's better too cache such rarely used data in shared memory than to let the L1 manage them based on the usage pattern that the algorithm actually has?

801

asked Jun 30 '12 16:06

Roger Dahl

1 Answers

2 big reasons why automatic caching is less efficient than manual scratch pad memory (applies to CPUs as well)

parallel accesses to random addresses are more efficient. Example: histogramming. Let's say you want to increment N bins, and each are > 256 bytes apart. Then due to coalescing rules, that will result in N serial reads/writes since global and cache memory is organized in large ~256byte blocks. Shared memory doesn't have that problem.

Also to access global memory, you have to do virtual to physical address translation. Having a TLB that can do lots of translations in || will be quite expensive. I haven't seen any SIMD architecture that actually does vector loads/stores in || and I believe this is the reason why.

avoids writing back dead values to memory, which wastes bandwidth & power. Example: in an image processing pipeline, you don't want your intermediate images to get flushed to memory.

Also, according to an NVIDIA employee, current L1 caches are write-through (immediately writes to L2 cache), which will slow down your program.

So basically, the caches get in the way if you really want performance.

146

answered Oct 29 '22 05:10

Yale Zhang

Related questions
                            
                                TextBlock TextWrapping not wrapping inside StackPanel
                            
                                How to remove an app from the Play store? [closed]
                            
                                Pip freeze does not show repository paths for requirements file
                            
                                SynchronizationContext.Current is null in Continuation on the main UI thread
                            
                                How do I make a relative reference to another workbook in Excel?
                            
                                When does a Qt widget get a paintEvent?
                            
                                Google chat server archive (XEP-0136)
                            
                                WinRT apps and Regional settings. The correct way to format dates and numbers based on the user's regional settings?
                            
                                How to integrate Facebook in iOS 6 using SLRequest?
                            
                                PhoneGap plugin: fastest way to transfer JSON data to native
                            
                                On the std::abs function
                            
                                how to navigate to pasted stack trace visual-studio

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With