L1 cache persistance across CUDA kernels

1 Answers

The SM L1 cache is invalidated between all operations on the same stream or the null stream to guarantee coherence. But it doesn't really matter, because the L1 cache on GPUs is not really designed to improve temporal locality within a given thread of execution. On a massively parallel processor, it is parallel spatial locality that matters. What this means is that you want threads that are executing nearby to each other to access data that are nearby to each other.

When a cached memory load is performed, it is done for a single warp, and the cache stores cache line(s) that are accessed by threads in that warp (ideally only a single line). If the next warp accesses the same cache line(s), then the cache will hit and latency will be reduced. Otherwise, the cache will be updated with different cache lines. If memory accesses are very spread out, then later warps will probably evict cache lines from earlier warps before they get reused.

By the time another kernel runs, it is not likely for the data in the cache to be valid because many warps are likely to have been run by that SM for the previous kernel, so it doesn't really matter if it persists.

146

answered Sep 30 '22 15:09

harrism

Related questions
                            
                                Limit GPU devices in Tensorflow
                            
                                How to determine which lines of CUDA use the most registers?
                            
                                How do you calculate the load on a nvidia (cuda capable), gpu card?
                            
                                CUDA Thrust slow when operating large vectors on my machine
                            
                                Analyzing memory access coalescing of my CUDA kernel
                            
                                How to prevent two CUDA programs from interfering
                            
                                conditional syncthreads & deadlock (or not)
                            
                                Error : argument of type "int" is incompatible with parameter of type "const void *"
                            
                                What's the efficient way to swap two register variables in CUDA?
                            
                                How can I check the progress of matrix multiplication?
                            
                                CMake does not properly find CUDA library
                            
                                Using multiple CUDA GPUs
                            
                                How to define CUDA device constant like a C++ const/constexpr?
                            
                                ERROR: The Compose file './docker-compose.yaml' is invalid because: Unsupported config option for services.nvidia-smi-test: 'runtime'
                            
                                Is it possible to enable syntax highlighting for CUDA 4.0 in Visual Studio 2010?
                            
                                Programming CUDA using Delphi or FreePascal
                            
                                Issue with production release of CUDA Toolkit 4.0 and Nsight 2.0
                            
                                Cuda Bayer/CFA demosaicing example
                            
                                initializer not allowed for __shared__ variable for cuda
                            
                                /usr/bin/ld: cannot find -lcudart

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

L1 cache persistance across CUDA kernels

Tags:

cpu-cache

cuda

gpu

gmemon

People also ask

1 Answers

harrism

Recent Activity

Donate For Us