Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cache or Registers - which is faster?

I'm sorry if this is the wrong place to ask this but I've searched and always found different answer. My question is:

Which is faster? Cache or CPU Registers?

According to me, the registers are what directly load data to execute it while the cache is just a storage place close or internally in the CPU.

Here are the sources I found that confuses me:

2 for cache | 1 for registers

http://in.answers.yahoo.com/question/index?qid=20110503030537AAzmDGp

Cache is faster.

http://wiki.answers.com/Q/Is_cache_memory_faster_than_CPU_registers

So which really is it?

like image 996
user1255454 Avatar asked Jan 24 '13 15:01

user1255454


People also ask

Is cache slower than registers?

Cache is faster. registers will always be fastest, because that is where execution "takes place". but registers are VERY limited in terms of storage. e.g. x86 processors only had 4 general purpose registers you could store things in to, and even those were dual-purposed for certain things.

Which one is faster cache or register or RAM?

In a computer, a register is the fastest memory. Register a part of the computer processor which is used to hold a computer instruction, perform mathematical operation as storage address, or any kind of data.


2 Answers

CPU register is always faster than the L1 cache. It is the closest. The difference is roughly a factor of 3.

Trying to make this as intuitive as possible without getting lost in the physics underlying the question: there is a simple correlation between speed and distance in electronics. The further you make a signal travel, the harder it gets to get that signal to the other end of the wire without the signal getting corrupted. It is the "there is no free lunch" principle of electronic design.

The corollary is that bigger is slower. Because if you make something bigger then inevitably the distances are going to get larger. Something that was automatic for a while, shrinking the feature size on the chip automatically produced a faster processor.

The register file in a processor is small and sits physically close to the execution engine. The furthest removed from the processor is the RAM. You can pop the case and actually see the wires between the two. In between sit the caches, designed to bridge the dramatic gap between the speed of those two opposites. Every processor has an L1 cache, relatively small (32 KB typically) and located closest to the core. Further down is the L2 cache, relatively big (4 MB typically) and located further from the core. More expensive processors also have an L3 cache, bigger and further away.

like image 174
Hans Passant Avatar answered Oct 06 '22 10:10

Hans Passant


Specifically on x86 architecture:

  • Reading from register has 0 or 1 cycle latency.
  • Writing to registers has 0 cycle latency.
  • Reading/Writing L1 cache has a 3 to 5 cycle latency (varies by architecture age)
  • Actual load/store requests may execute within 0 or 1 cycles due to write-back buffer and store-forwarding features (details below)

Reading from register can have a 1 cycle latency on Intel Core 2 CPUs (and earlier models) due to its design: If enough simultaneously-executing instructions are reading from different registers, the CPU's register bank will be unable to service all the requests in a single cycle. This design limitation isn't present in any x86 chip that's been put on the consumer market since 2010 (but it is present in some 2010/11-released Xeon chips).

L1 cache latencies are fixed per-model but tend to get slower as you go back in time to older models. However, keep in mind three things:

  1. x86 chips these days have a write-back cache that has a 0 cycle latency. When you store a value to memory it falls into that cache, and the instruction is able to retire in a single cycle. Memory latency then only becomes visible if you issue enough consecutive writes to fill the write-back cache. Writeback caches have been prominent in desktop chip design since about 2001, but was widely missing from the ARM-based mobile chip markets until much more recently.

  2. x86 chips these days have store forwarding from the write-back cache. If you store an address to the WB cache and then read back the same address several instructions later, the CPU will fetch the value from the WB cache instead of accessing L1 memory for it. This reduces the visible latency on what appears to be an L1 request to 1 cycle. But in fact, the L1 isn't be referenced at all in that case. Store forwarding also has some other rules for it to work properly, which also vary a lot across the various CPUs available on the market today (typically requiring 128-bit address alignment and matched operand size).

  3. The store forwarding feature can generate false positives where-in the CPU thinks the address is in the writeback buffer based on a fast partial-bits check (usually 10-14 bits, depending on chip). It uses an extra cycle to verify with a full check. If that fails then the CPU must re-route as a regular memory request. This miss can add an extra 1-2 cycles latency to qualifying L1 cache accesses. In my measurements, store forwarding misses happen quite often on AMD's Bulldozer, for example; enough so that its L1 cache latency over-time is about 10-15% higher than its documented 3-cycles. It is almost a non-factor on Intel's Core series.

Primary reference: http://www.agner.org/optimize/ and specifically http://www.agner.org/optimize/microarchitecture.pdf

And then manually graph info from that with the tables on architectures, models, and release dates from the various List of CPUs pages on wikipedia.

like image 44
jstine Avatar answered Oct 06 '22 11:10

jstine