GPU L1 and L2 cache statistics

Question

I have written some simple benchmarks that perform a series of global memory accesses. When I measure the L1 and L2 cache statistics, I've found out that (in GTX580 that has 16 SMs):

 total L1 cache misses * 16 != total L2 cache queries

Indeed the right side is much higher than the left side (around five times). I've heard that some register spilling can be put into L2 too. But my kernel has only less than 28 registers, not that many. I wonder what would be the source of this difference? Or am I misinterpreting the meaning of those performance counters?

Thanks

Gaszton · Accepted Answer

cuda programming guide G.4.2 section:

Global memory accesses are cached. Using the –dlcm compilation flag, they can be configured at compile time to be cached in both L1 and L2 (-Xptxas -dlcm=ca) (this is the default setting) or in L2 only (-Xptxas -dlcm=cg). A cache line is 128 bytes and maps to a 128-byte aligned segment in device memory. Memory accesses that are cached in both L1 and L2 are serviced with 128-byte memory transactions whereas memory accesses that are cached in L2 only are serviced with 32-byte memory transactions. Caching in L2 only can therefore reduce over-fetch, for example, in the case of scattered memory accesses.

Ravi · Answer

It could be due to fact that reads from L1 are 128 bytes long while reads from L2 are 32 bytes long.

GPU L1 and L2 cache statistics

Tags:

cuda

gpgpu

gpu

opencl

Zk1001

2 Answers

Gaszton

Ravi

Recent Activity

Donate For Us

GPU L1 and L2 cache statistics

Tags:

cuda

gpgpu

gpu

opencl

Zk1001

2 Answers

Gaszton

Ravi

Related questions

Recent Activity

Donate For Us