Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

L1 cache persistance across CUDA kernels

I understand that shared memory on GPU does not persist across different kernels. However, does the L1 cache persist across different kernel calls?

like image 949
gmemon Avatar asked Jul 02 '12 00:07

gmemon


People also ask

Does GPU have L1 cache?

GPUs provide high-bandwidth/low-latency on-chip shared memory and L1 cache to efficiently service a large number of concurrent memory requests. Specifically, concurrent memory requests accessing contiguous memory space are coalesced into warp-wide accesses.

What are CUDA kernels?

Figure 1 shows that the CUDA kernel is a function that gets executed on GPU. The parallel portion of your applications is executed K times in parallel by K different CUDA threads, as opposed to only one time like regular C/C++ functions. Figure 1. The kernel is a function executed on the GPU.

What is function of _global_ qualifier in CUDA program?

__global__ : 1. A qualifier added to standard C. This alerts the compiler that a function should be compiled to run on a device (GPU) instead of host (CPU).

What is memory hierarchy in CUDA?

This means that data that is processed by the GPU must be moved from the CPU to the GPU before the computation starts, and the results of the computation must be moved back to the CPU once processing has completed.


1 Answers

The SM L1 cache is invalidated between all operations on the same stream or the null stream to guarantee coherence. But it doesn't really matter, because the L1 cache on GPUs is not really designed to improve temporal locality within a given thread of execution. On a massively parallel processor, it is parallel spatial locality that matters. What this means is that you want threads that are executing nearby to each other to access data that are nearby to each other.

When a cached memory load is performed, it is done for a single warp, and the cache stores cache line(s) that are accessed by threads in that warp (ideally only a single line). If the next warp accesses the same cache line(s), then the cache will hit and latency will be reduced. Otherwise, the cache will be updated with different cache lines. If memory accesses are very spread out, then later warps will probably evict cache lines from earlier warps before they get reused.

By the time another kernel runs, it is not likely for the data in the cache to be valid because many warps are likely to have been run by that SM for the previous kernel, so it doesn't really matter if it persists.

like image 146
harrism Avatar answered Sep 30 '22 15:09

harrism