Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

GPU L1 and L2 cache statistics

I have written some simple benchmarks that perform a series of global memory accesses. When I measure the L1 and L2 cache statistics, I've found out that (in GTX580 that has 16 SMs):

 total L1 cache misses * 16 != total L2 cache queries

Indeed the right side is much higher than the left side (around five times). I've heard that some register spilling can be put into L2 too. But my kernel has only less than 28 registers, not that many. I wonder what would be the source of this difference? Or am I misinterpreting the meaning of those performance counters?

Thanks

like image 695
Zk1001 Avatar asked Sep 19 '11 10:09

Zk1001


2 Answers

cuda programming guide G.4.2 section:

Global memory accesses are cached. Using the –dlcm compilation flag, they can be configured at compile time to be cached in both L1 and L2 (-Xptxas -dlcm=ca) (this is the default setting) or in L2 only (-Xptxas -dlcm=cg). A cache line is 128 bytes and maps to a 128-byte aligned segment in device memory. Memory accesses that are cached in both L1 and L2 are serviced with 128-byte memory transactions whereas memory accesses that are cached in L2 only are serviced with 32-byte memory transactions. Caching in L2 only can therefore reduce over-fetch, for example, in the case of scattered memory accesses.

like image 155
Gaszton Avatar answered Oct 16 '22 09:10

Gaszton


It could be due to fact that reads from L1 are 128 bytes long while reads from L2 are 32 bytes long.

like image 29
Ravi Avatar answered Oct 16 '22 09:10

Ravi