I'm sorry if this is the wrong place to ask this but I've searched and always found different answer. My question is: Which is faster? Cache or CPU Registers? According to me, the registers are what directly load data to execute it while the cache is just a storage place close or internally in the CPU. Here are the sources I found that confuses me: 2 for cache | 1 for registers http://in.answers.yahoo.com/question/index?qid=20110503030537AAzmDGp Cache is faster. http://wiki.answers.com/Q/Is_cache_memory_faster_than_CPU_registers <blockquote> So which really is it? </blockquote>

Specifically on x86 architecture: <ul> <li>Reading from register has 0 or 1 cycle latency.</li> <li>Writing to registers has 0 cycle latency.</li> <li>Reading/Writing L1 cache has a 3 to 5 cycle latency (varies by architecture age)</li> <li>Actual load/store requests may execute within 0 or 1 cycles due to write-back buffer and store-forwarding features (details below)</li> </ul> Reading from register can have a 1 cycle latency on Intel Core 2 CPUs (and earlier models) due to its design: If enough simultaneously-executing instructions are reading from different registers, the CPU's register bank will be unable to service all the requests in a single cycle. This design limitation isn't present in any x86 chip that's been put on the consumer market since 2010 (but it is present in some 2010/11-released Xeon chips). L1 cache latencies are fixed per-model but tend to get slower as you go back in time to older models. However, keep in mind three things: <ol> <li>x86 chips these days have a write-back cache that has a 0 cycle latency. When you store a value to memory it falls into that cache, and the instruction is able to retire in a single cycle. Memory latency then only becomes visible if you issue enough consecutive writes to fill the write-back cache. Writeback caches have been prominent in desktop chip design since about 2001, but was widely missing from the ARM-based mobile chip markets until much more recently.</li> <li>x86 chips these days have store forwarding from the write-back cache. If you store an address to the WB cache and then read back the same address several instructions later, the CPU will fetch the value from the WB cache instead of accessing L1 memory for it. This reduces the visible latency on what appears to be an L1 request to 1 cycle. But in fact, the L1 isn't be referenced at all in that case. Store forwarding also has some other rules for it to work properly, which also vary a lot across the various CPUs available on the market today (typically requiring 128-bit address alignment and matched operand size).</li> <li>The store forwarding feature can generate false positives where-in the CPU thinks the address is in the writeback buffer based on a fast partial-bits check (usually 10-14 bits, depending on chip). It uses an extra cycle to verify with a full check. If that fails then the CPU must re-route as a regular memory request. This miss can add an extra 1-2 cycles latency to qualifying L1 cache accesses. In my measurements, store forwarding misses happen quite often on AMD's Bulldozer, for example; enough so that its L1 cache latency over-time is about 10-15% higher than its documented 3-cycles. It is almost a non-factor on Intel's Core series.</li> </ol> Primary reference: http://www.agner.org/optimize/ and specifically http://www.agner.org/optimize/microarchitecture.pdf And then manually graph info from that with the tables on architectures, models, and release dates from the various List of CPUs pages on wikipedia.

Cache or Registers - which is faster?

2 Answers

CPU register is always faster than the L1 cache. It is the closest. The difference is roughly a factor of 3.

Trying to make this as intuitive as possible without getting lost in the physics underlying the question: there is a simple correlation between speed and distance in electronics. The further you make a signal travel, the harder it gets to get that signal to the other end of the wire without the signal getting corrupted. It is the "there is no free lunch" principle of electronic design.

The corollary is that bigger is slower. Because if you make something bigger then inevitably the distances are going to get larger. Something that was automatic for a while, shrinking the feature size on the chip automatically produced a faster processor.

The register file in a processor is small and sits physically close to the execution engine. The furthest removed from the processor is the RAM. You can pop the case and actually see the wires between the two. In between sit the caches, designed to bridge the dramatic gap between the speed of those two opposites. Every processor has an L1 cache, relatively small (32 KB typically) and located closest to the core. Further down is the L2 cache, relatively big (4 MB typically) and located further from the core. More expensive processors also have an L3 cache, bigger and further away.

174

answered Oct 06 '22 10:10

Hans Passant

Specifically on x86 architecture:

Reading from register has 0 or 1 cycle latency.
Writing to registers has 0 cycle latency.
Reading/Writing L1 cache has a 3 to 5 cycle latency (varies by architecture age)
Actual load/store requests may execute within 0 or 1 cycles due to write-back buffer and store-forwarding features (details below)

Reading from register can have a 1 cycle latency on Intel Core 2 CPUs (and earlier models) due to its design: If enough simultaneously-executing instructions are reading from different registers, the CPU's register bank will be unable to service all the requests in a single cycle. This design limitation isn't present in any x86 chip that's been put on the consumer market since 2010 (but it is present in some 2010/11-released Xeon chips).

L1 cache latencies are fixed per-model but tend to get slower as you go back in time to older models. However, keep in mind three things:

x86 chips these days have a write-back cache that has a 0 cycle latency. When you store a value to memory it falls into that cache, and the instruction is able to retire in a single cycle. Memory latency then only becomes visible if you issue enough consecutive writes to fill the write-back cache. Writeback caches have been prominent in desktop chip design since about 2001, but was widely missing from the ARM-based mobile chip markets until much more recently.
x86 chips these days have store forwarding from the write-back cache. If you store an address to the WB cache and then read back the same address several instructions later, the CPU will fetch the value from the WB cache instead of accessing L1 memory for it. This reduces the visible latency on what appears to be an L1 request to 1 cycle. But in fact, the L1 isn't be referenced at all in that case. Store forwarding also has some other rules for it to work properly, which also vary a lot across the various CPUs available on the market today (typically requiring 128-bit address alignment and matched operand size).
The store forwarding feature can generate false positives where-in the CPU thinks the address is in the writeback buffer based on a fast partial-bits check (usually 10-14 bits, depending on chip). It uses an extra cycle to verify with a full check. If that fails then the CPU must re-route as a regular memory request. This miss can add an extra 1-2 cycles latency to qualifying L1 cache accesses. In my measurements, store forwarding misses happen quite often on AMD's Bulldozer, for example; enough so that its L1 cache latency over-time is about 10-15% higher than its documented 3-cycles. It is almost a non-factor on Intel's Core series.

Primary reference: http://www.agner.org/optimize/ and specifically http://www.agner.org/optimize/microarchitecture.pdf

And then manually graph info from that with the tables on architectures, models, and release dates from the various List of CPUs pages on wikipedia.

answered Oct 06 '22 11:10

jstine

Related questions
                            
                                PHPStorm exceptionally slow while editing Javascript
                            
                                Performance penalty using 'auto' keyword in Visual Studio 2010
                            
                                Console.WriteLine slow
                            
                                IN vs. JOIN with large rowsets
                            
                                Examples of a monad whose Applicative part can be better optimized than the Monad part
                            
                                JavaScript Performance: Multiple variables or one object?
                            
                                Measuring Cache Latencies
                            
                                Performance gap between vector<bool> and array
                            
                                Setting Oracle size of row fetches higher makes my app slower?
                            
                                C++: What is faster - lookup in hashmap or switch statement?
                            
                                Golang - Difference between "go run main.go" and compilation
                            
                                Improving query speed: simple SELECT in big postgres table
                            
                                How to measure FPS on Android during app development
                            
                                Fastest way to move files on a Windows System [closed]
                            
                                Performance of using static methods vs instantiating the class containing the methods
                            
                                Float or Double?
                            
                                What is the fastest way to merge two lists in python?
                            
                                Why does C# compiler create private DisplayClass when using LINQ method Any() and how can I avoid it?
                            
                                Swift vs Objective-C: App performance [closed]
                            
                                Find out where your PHP code is slowing down (Performance Issue)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Cache or Registers - which is faster?

Tags:

performance

memory

caching

cpu

cpu-registers

user1255454

People also ask

2 Answers

Hans Passant

jstine

Recent Activity

Donate For Us