Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cycles/cost for L1 Cache hit vs. Register on x86?

I remember assuming that an L1 cache hit is 1 cycle (i.e. identical to register access time) in my architecture class, but is that actually true on modern x86 processors?

How many cycles does an L1 cache hit take? How does it compare to register access?

like image 205
user541686 Avatar asked Apr 23 '12 03:04

user541686


People also ask

Are registers more expensive than L2 cache?

Supply and demand. Registers are very scarce, so they are very expensive. L1 cache is scarce, but not as a register. L2 more abundant.

How many cycles would it take to service an L1 cache miss?

For comparison, on one particular modern multicore PowerPC CPU, an L1 miss is about 40 cycles -- a little longer for some cores than others, depending on how far they are from the L2 cache (yes really). An L2 miss is at least 600 cycles.

Is L1 cache more expensive than RAM?

In order to be close to the processor, cache memory needs to be much smaller than main memory. Consequently, it has less storage space. It is also more expensive than main memory, as it is a more complex chip that yields higher performance.

Is L1 cache more expensive than L2?

Show activity on this post. I think the main reason for this is, that L1-Cache is faster and so it's more expensive. Compare the size of the L1, L2, and L3 caches physical size for an AMD Zen core, for example. The density increases dramatically with the cache level.


2 Answers

Here's a great article on the subject:

http://arstechnica.com/gadgets/reviews/2002/07/caching.ars/1

To answer your question - yes, a cache hit has approximately the same cost as a register access. And of course a cache miss is quite costly ;)

PS:

The specifics will vary, but this link has some good ballpark figures:

Approximate cost to access various caches and main memory?

Core i7 Xeon 5500 Series Data Source Latency (approximate) L1 CACHE hit, ~4 cycles L2 CACHE hit, ~10 cycles L3 CACHE hit, line unshared ~40 cycles L3 CACHE hit, shared line in another core ~65 cycles L3 CACHE hit, modified in another core ~75 cycles remote L3 CACHE ~100-300 cycles Local DRAM ~30 ns (~120 cycles) Remote DRAM ~100 ns  

PPS:

These figures represent much older, slower CPUs, but the ratios basically hold:

http://arstechnica.com/gadgets/reviews/2002/07/caching.ars/2

Level                    Access Time  Typical Size  Technology    Managed By -----                    -----------  ------------  ---------     ----------- Registers                1-3 ns       ?1 KB          Custom CMOS  Compiler Level 1 Cache (on-chip)  2-8 ns       8 KB-128 KB    SRAM         Hardware Level 2 Cache (off-chip) 5-12 ns      0.5 MB - 8 MB  SRAM         Hardware Main Memory              10-60 ns     64 MB - 1 GB   DRAM         Operating System Hard Disk                3M - 10M ns  20 - 100 GB    Magnetic     Operating System/User 
like image 155
paulsm4 Avatar answered Sep 27 '22 20:09

paulsm4


Throughput and latency are different things. You can't just add up cycle costs. For throughput, see Load/stores per cycle for recent CPU architecture generations - 2 loads per clock throughput for most modern microarchitectures. And see How can cache be that fast? for microarchitectural details of load/store execution units, including showing load / store buffers which limit how much memory-level parallelism they can track. The rest of this answer will focus only on latency, which is relevant for workloads that involve pointer-chasing (like linked lists and trees), and how much latency out-of-order exec needs to hide. (L3 Cache misses are usually too long to fully hide.)

Single-cycle cache latency used to be a thing on simple in-order pipelines at lower clock speeds (so each cycle was more nanoseconds), especially with simpler caches (smaller, not as associative, and with a smaller TLB for caches that weren't purely virtually addressed.) e.g. the classic 5-stage RISC pipeline like MIPS I assumes 1 cycle for memory access on a cache hit, with address calculation in EX and memory access in a single MEM pipeline stage, before WB.

Modern high-performance CPUs divide the pipeline up into more stages, allowing each cycle to be shorter. This lets simple instructions like add / or / and run really fast, still 1 cycle latency but at high clock speed.


For more details about cycle-counting and out-of-order execution, see Agner Fog's microarch pdf, and other links in the x86 tag wiki.


Intel Haswell's L1 load-use latency is 4 cycles for pointer-chasing, which is typical of modern x86 CPUs. i.e. how fast mov eax, [eax] can run in a loop, with a pointer that points to itself. (Or for a linked list that hits in cache, easy to microbench with a closed loop). See also Is there a penalty when base+offset is in a different page than the base? That 4-cycle latency special case only applies if the pointer comes directly from another load, otherwise it's 5 cycles.

Load-use latency is 1 cycle higher for SSE/AVX vectors in Intel CPUs.


Store-reload latency is 5 cycles, and is unrelated to cache hit or miss (it's store-forwarding, reading from the store buffer for store data that hasn't yet committed to L1d cache).

As harold commented, register access is 0 cycles. So, for example:

  • inc eax has 1 cycle latency (just the ALU operation)
  • add dword [mem], 1 has 6 cycle latency until a load from dword [mem] will be ready. (ALU + store-forwarding). e.g. keeping a loop counter in memory limits a loop to one iteration per 6 cycles.
  • mov rax, [rsi] has 4 cycle latency from rsi being ready to rax being ready on an L1 hit (L1 load-use latency.)

http://www.7-cpu.com/cpu/Haswell.html has a table of latency per cache (which I'll copy here), and some other experimental numbers, including L2-TLB hit latency (on an L1DTLB miss).

Intel i7-4770 (Haswell), 3.4 GHz (Turbo Boost off), 22 nm. RAM: 32 GB (PC3-12800 cl11 cr2).

  • L1 Data cache = 32 KB, 64 B/line, 8-WAY.

  • L1 Instruction cache = 32 KB, 64 B/line, 8-WAY.

  • L2 cache = 256 KB, 64 B/line, 8-WAY

  • L3 cache = 8 MB, 64 B/line

  • L1 Data Cache Latency = 4 cycles for simple access via pointer (mov rax, [rax])

  • L1 Data Cache Latency = 5 cycles for access with complex address calculation (mov rax, [rsi + rax*8]).

  • L2 Cache Latency = 12 cycles

  • L3 Cache Latency = 36 cycles

  • RAM Latency = 36 cycles + 57 ns

The top-level benchmark page is http://www.7-cpu.com/utils.html, but still doesn't really explain what the different test-sizes mean, but the code is available. The test results include Skylake, which is nearly the same as Haswell in this test.

@paulsm4's answer has a table for a multi-socket Nehalem Xeon, including some remote (other-socket) memory / L3 numbers.

like image 22
Peter Cordes Avatar answered Sep 27 '22 19:09

Peter Cordes