Cache bandwidth per tick for modern CPUs

Tags:

What is a speed of cache accessing for modern CPUs? How many bytes can be read or written from memory every processor clock tick by Intel P4, Core2, Corei7, AMD?

Please, answer with both theoretical (width of ld/sd unit with its throughput in uOPs/tick) and practical numbers (even memcpy speed tests, or STREAM benchmark), if any.

PS it is question, related to maximal rate of load/store instructions in assembler. There can be theoretical rate of loading (all Instructions Per Tick are widest loads), but processor can give only part of such, a practical limit of loading.

213

asked Mar 01 '10 00:03

osgx

2 Answers

For nehalem: rolfed.com/nehalem/nehalemPaper.pdf

Each core in the architecture has a 128-bit write port and a
128-bit read port to the L1 cache.

128 bit = 16 bytes / clock read AND 128 bit = 16 bytes / clock write (can I combine read and write in single cycle?)

The L2 and L3 caches each have a 256-bit port for reading or writing, 
but the L3 cache must share its port with three other cores on the chip.

Can L2 and L3 read and write ports be used in single clock?

Each integrated memory controller has a theoretical bandwidth
peak of 32 Gbps.

Latency (clock ticks), some measured by CPU-Z's latencytool or by lmbench's lat_mem_rd - both uses long linked list walk to correctly measure modern out-of-order cores like Intel Core i7

           L1     L2     L3, cycles;   mem             link
Core 2      3     15     --           66 ns           http://www.anandtech.com/show/2542/5
Core i7-xxx 4     11     39          40c+67ns         http://www.anandtech.com/show/2542/5
Itanium     1     5-6    12-17       130-1000 (cycles)
Itanium2    2     6-10   20          35c+160ns        http://www.7-cpu.com/cpu/Itanium2.html
AMD K8            12                 40-70c +64ns     http://www.anandtech.com/show/2139/3
Intel P4    2     19     43          200-210 (cycles) http://www.arsc.edu/files/arsc/phys693_lectures/Performance_I_Arch.pdf
AthlonXP 3k 3     20                 180 (cycles)     --//--
AthlonFX-51 3     13                 125 (cycles)     --//--
POWER4      4     12-20  ??          hundreds cycles  --//--
Haswell     4     11-12  36          36c+57ns         http://www.realworldtech.com/haswell-cpu/5/

And good source on latency data is 7cpu web-site, e.g. for Haswell: http://www.7-cpu.com/cpu/Haswell.html

More about lat_mem_rd program is in its man page or here on SO.

answered Sep 21 '22 23:09

osgx

Widest read/writes are 128 bit (16 byte) SSE load/store. L1/L2/L3 caches have different bandwidths and latencies and these are of course CPU-specific. Typical L1 latency is 2 - 4 clocks on modern CPUs but you can usually issue 1 or 2 load instructions per clock.

I suspect there's a more specific question lurking here somewhere - what is it that you are actually trying to achieve ? Do you just want to write the fastest possible memcpy ?

answered Sep 23 '22 23:09

Paul R

Related questions
                            
                                Accessing Big Arrays in PHP
                            
                                Javascript Performance: How come looping through an array and checking every value is faster than indexOf, search and match?
                            
                                requestFocus for TextView on Jelly Bean slow
                            
                                AtomicInteger implementation and code duplication
                            
                                How to find pair with kth largest sum?
                            
                                Python garbage collection can be that slow?
                            
                                Optimizing linear access to arrays with pre-fetching and cache in C
                            
                                Java 8: Class.getName() slows down String concatenation chain
                            
                                How to implement ConcurrentHashMap with features similar in LinkedHashMap?
                            
                                Creating & Editing performance counters in a powershell script or command line
                            
                                NumPy performance: uint8 vs. float and multiplication vs. division?
                            
                                Reproducibility and performance in PyTorch
                            
                                Why don't I just build the whole web app in Javascript and Javascript HTML Templates?
                            
                                Does field type matter in a MongoDB index?
                            
                                CSS child selector performance vs. class bloat
                            
                                Cache-friendly copying of an array with readjustment by known index, gather, scatter
                            
                                High number of live/dead tuples in postgresql/ Vacuum not working
                            
                                Increase the performance by removing CLEAR ALL
                            
                                Hash table faster in C# than C++?
                            
                                How to disable oracle cache for performance tests

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Cache bandwidth per tick for modern CPUs

Tags:

performance

cpu-architecture

caching

cpu-cache

cpu

osgx

People also ask

2 Answers

osgx

Paul R

Recent Activity

Donate For Us