Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CUDA programming - L1 and L2 caches

Could you please explain the differences between using both "L1 and L2" caches or "only L2" cache in CUDA programming? What should I expect in time execution? When could I expect smaller gpu time? When I enable both L1 and L2 caches or just enable L2? thanks

like image 488
Saman I Avatar asked Apr 16 '12 20:04

Saman I


People also ask

What is L1 cache and L2 cache?

L1 is "level-1" cache memory, usually built onto the microprocessor chip itself. For example, the Intel MMX microprocessor comes with 32 thousand bytes of L1. L2 (that is, level-2) cache memory is on a separate chip (possibly on an expansion card) that can be accessed more quickly than the larger "main" memory.

Which is better L1 or L2 cache?

When it comes to speed, the L2 cache lags behind the L1 cache but is still much faster than your system RAM. The L1 memory cache is typically 100 times faster than your RAM, while the L2 cache is around 25 times faster.

What is the difference between L1 L2 and L3 cache?

The main difference between L1 L2 and L3 cache is that L1 cache is the fastest cache memory and L3 cache is the slowest cache memory while L2 cache is slower than L1 cache but faster than L3 cache. Cache is a fast memory in the computer. It holds frequently used data by the CPU.

What does L1 L2 L3 cache do?

L1 is usually part of the CPU chip itself and is both the smallest and the fastest to access. Its size is often restricted to between 8 KB and 64 KB. L2 and L3 caches are bigger than L1. They are extra caches built between the CPU and the RAM.


1 Answers

Typically you would leave both L1 and L2 caches enabled. You should try to coalesce your memory accesses as much as possible, i.e. threads within a warp should access data within the same 128B segment as much as possible (see the CUDA Programming Guide for more info on this topic).

Some programs are unable to be optimised in this manner, their memory accesses are completely random for example. For those cases it may be beneficial to bypass the L1 cache, thereby avoiding loading an entire 128B line when you only want, for example, 4 bytes (you'll still load 32B since that is the minimum). Clearly there is an efficiency gain: 4 useful bytes from 128 is improved to 4 from 32.

like image 51
Tom Avatar answered Sep 21 '22 06:09

Tom