Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cache bandwidth per tick for modern CPUs

What is a speed of cache accessing for modern CPUs? How many bytes can be read or written from memory every processor clock tick by Intel P4, Core2, Corei7, AMD?

Please, answer with both theoretical (width of ld/sd unit with its throughput in uOPs/tick) and practical numbers (even memcpy speed tests, or STREAM benchmark), if any.

PS it is question, related to maximal rate of load/store instructions in assembler. There can be theoretical rate of loading (all Instructions Per Tick are widest loads), but processor can give only part of such, a practical limit of loading.

like image 213
osgx Avatar asked Mar 01 '10 00:03

osgx


People also ask

How much cache do modern CPUs have?

Modern CPUs include up to 512KB of L1 cache (64KB per core) for flagship processors while server parts feature almost twice as much. L2 cache is much larger than L1 but at the same time slower as well. They range from 4-8MB on flagship CPUs (512KB per core).

How fast should my CPU cache be?

Cache memory operates between 10 to 100 times faster than RAM, requiring only a few nanoseconds to respond to a CPU request. The name of the actual hardware that is used for cache memory is high-speed static random access memory (SRAM).

What is a good cache size?

The higher the demand from these factors, the larger the cache needs to be to maintain good performance. Disk caches smaller than 10 MB do not generally perform well. Machines serving multiple users usually perform better with a cache of at least 60 to 70 MB.

Is L3 cache important for gaming?

Level 3 cache on modern Intel and AMD CPUs boosts gaming performance by upto ~10% Before we begin I think a general recap on caches is in order. Those who want to get to the benchmarks directly can skip the first three paragraphs. Caches are probably one of the most underrated instances of memory in a computer system.


2 Answers

For nehalem: rolfed.com/nehalem/nehalemPaper.pdf

Each core in the architecture has a 128-bit write port and a
128-bit read port to the L1 cache. 

128 bit = 16 bytes / clock read AND 128 bit = 16 bytes / clock write (can I combine read and write in single cycle?)

The L2 and L3 caches each have a 256-bit port for reading or writing, 
but the L3 cache must share its port with three other cores on the chip.

Can L2 and L3 read and write ports be used in single clock?

Each integrated memory controller has a theoretical bandwidth
peak of 32 Gbps.

Latency (clock ticks), some measured by CPU-Z's latencytool or by lmbench's lat_mem_rd - both uses long linked list walk to correctly measure modern out-of-order cores like Intel Core i7

           L1     L2     L3, cycles;   mem             link
Core 2      3     15     --           66 ns           http://www.anandtech.com/show/2542/5
Core i7-xxx 4     11     39          40c+67ns         http://www.anandtech.com/show/2542/5
Itanium     1     5-6    12-17       130-1000 (cycles)
Itanium2    2     6-10   20          35c+160ns        http://www.7-cpu.com/cpu/Itanium2.html
AMD K8            12                 40-70c +64ns     http://www.anandtech.com/show/2139/3
Intel P4    2     19     43          200-210 (cycles) http://www.arsc.edu/files/arsc/phys693_lectures/Performance_I_Arch.pdf
AthlonXP 3k 3     20                 180 (cycles)     --//--
AthlonFX-51 3     13                 125 (cycles)     --//--
POWER4      4     12-20  ??          hundreds cycles  --//--
Haswell     4     11-12  36          36c+57ns         http://www.realworldtech.com/haswell-cpu/5/    

And good source on latency data is 7cpu web-site, e.g. for Haswell: http://www.7-cpu.com/cpu/Haswell.html

More about lat_mem_rd program is in its man page or here on SO.

like image 58
osgx Avatar answered Sep 21 '22 23:09

osgx


Widest read/writes are 128 bit (16 byte) SSE load/store. L1/L2/L3 caches have different bandwidths and latencies and these are of course CPU-specific. Typical L1 latency is 2 - 4 clocks on modern CPUs but you can usually issue 1 or 2 load instructions per clock.

I suspect there's a more specific question lurking here somewhere - what is it that you are actually trying to achieve ? Do you just want to write the fastest possible memcpy ?

like image 26
Paul R Avatar answered Sep 23 '22 23:09

Paul R