Where is the L1 memory cache of Intel x86 processors documented?

People also ask

Where is the L1 cache memory located?

L1 cache, or primary cache, is extremely fast but relatively small, and is usually embedded in the processor chip as CPU cache.

Where is L1 and L2 cache memory located?

L1 is "level-1" cache memory, usually built onto the microprocessor chip itself. For example, the Intel MMX microprocessor comes with 32 thousand bytes of L1. L2 (that is, level-2) cache memory is on a separate chip (possibly on an expansion card) that can be accessed more quickly than the larger "main" memory.

Where is the location of L1 cache inside processor but outside CPU core?

So, L3 cache is 64 to 256 times of L1 cache. L1 is placed inside CPU core while L2 is located in a processor chip but outside CPU core. L3 cache is outside processor chip. L1 cache is divided into two parts: Instruction and Data.

Is L1 cache in the core?

There is more space for RAM, which is usually larger and less expensive. Each CPU core has its own L1 cache, but may share L2 and L3 caches.

It is near impossible to find specs on Intel caches. When I was teaching a class on caches last year, I asked friends inside Intel (in the compiler group) and they couldn't find specs.

But wait!!! Jed, bless his soul, tells us that on Linux systems, you can squeeze lots of information out of the kernel:

grep . /sys/devices/system/cpu/cpu0/cache/index*/*

This will give you associativity, set size, and a bunch of other information (but not latency). For example, I learned that although AMD advertises their 128K L1 cache, my AMD machine has a split I and D cache of 64K each.

Two suggestions which are now mostly obsolete thanks to Jed:

AMD publishes a lot more information about its caches, so you can at least got some information about a modern cache. For example, last year's AMD L1 caches delivered two words per cycle (peak).
The open-source tool valgrind has all sorts of cache models inside it, and it is invaluable for profiling and understanding cache behavior. It comes with a very nice visualization tool kcachegrind which is part of the KDE SDK.

For example: in Q3 2008, AMD K8/K10 CPUs use 64 byte cache lines, with a 64kB each L1I/L1D split cache. L1D is 2-way associative and exclusive with L2, with latency of 3 cycles. L2 cache is 16-way associative and latency is about 12 cycles.

AMD Bulldozer-family CPUs use a split L1 with a 16kiB 4-way associative L1D per cluster (2 per core).

Intel CPUs have kept L1 the same for a long time (from Pentium M to Haswell to Skylake, and presumably many generations after that): Split 32kB each I and D caches, with L1D being 8-way associative. 64 byte cache lines, matching the burst-transfer size of DDR DRAM. Load-use latency is ~4 cycles.

Also see the x86 tag wiki for links to more performance and microarchitectural data.

This Intel Manual: Intel® 64 and IA-32 Architectures Optimization Reference Manual has a decent discussion of cache considerations.

enter image description here

Page 46, Section 2.2.5.1 Intel® 64 and IA-32 Architectures Optimization Reference Manual

Even MicroSlop is waking up to the need for more tools to monitor cache usage and performance, and has a GetLogicalProcessorInformation() function example (...while blazing new trails in creating ridiculously long function names in the process) I think I'll code up.

UPDATE I: Hazwell increases cache load performance 2X, from Inside the Tock; Haswell's Architecture

If there were any doubt how critical it is to make the best possible use of cache, this presentation by Cliff Click, formerly of Azul, should dispel any and all doubt. In his words, "memory is the new disk!".

Haswell’s URS (Unified Reservation Station)

UPDATE II: SkyLake's significantly improved cache performance specifications.

SkyLake Cache Specifications

You are looking at the consumer specifications, not the developer specifications. Here is the documentation you want. The cache sizes vary by processor family sub-models, so they typically are not in the IA-32 development manuals, but you can easily look them up on NewEgg and such.

Edit: More specifically: Chapter 10 of Volume 3A (Systems Programming Guide), Chapter 7 of the Optimization Reference Manual, and potentially something in the TLB page-caching manual, although I would assume that one is further out from the L1 than you care about.

I did some more investigating. There is a group at ETH Zurich who built a memory-performance evaluation tool which might be able to get information about the size at least (and maybe also associativity) of L1 and L2 caches. The program works by trying different read patterns experimentally and measuring the resulting throughput. A simplified version was used for the popular textbook by Bryant and O'Hallaron.

L1 caches exist on these platforms. This will almost definitly remain true until memory and front side bus speeds exceed the speed of the CPU, which is a very likely a long way off.

On Windows, you can use the GetLogicalProcessorInformation to get some level of cache information (size, line size, associativity, etc.) The Ex version on Win7 will give even more data, like which cores share which cache. CpuZ also gives this information.

Locality of Reference has a major impact on performance of some algorithms; The size and speed of L1, L2 (and on newer CPUs L3) cache obviously play a large part in this. Matrix multiplication is one such algorithm.

Intel Manual Vol. 2 specifies the following formula to compute cache size:

This Cache Size in Bytes

= (Ways + 1) * (Partitions + 1) * (Line_Size + 1) * (Sets + 1)

= (EBX[31:22] + 1) * (EBX[21:12] + 1) * (EBX[11:0] + 1) * (ECX + 1)

Where the Ways, Partitions, Line_Size and Sets are queried using cpuid with eax set to 0x04.

Providing the header file declaration

x86_cache_size.h:

unsigned int get_cache_line_size(unsigned int cache_level);

The implementation looks as follows:

;1st argument - the cache level
get_cache_line_size:
    push rbx
    ;set line number argument to be used with CPUID instruction
    mov ecx, edi 
    ;set cpuid initial value
    mov eax, 0x04
    cpuid

    ;cache line size
    mov eax, ebx
    and eax, 0x7ff
    inc eax

    ;partitions
    shr ebx, 12
    mov edx, ebx
    and edx, 0x1ff
    inc edx
    mul edx

    ;ways of associativity
    shr ebx, 10
    mov edx, ebx
    and edx, 0x1ff
    inc edx
    mul edx

    ;number of sets
    inc ecx
    mul ecx

    pop rbx

    ret

Which on my machine works as follows:

#include "x86_cache_size.h"

int main(void){
    unsigned int L1_cache_size = get_cache_line_size(1);
    unsigned int L2_cache_size = get_cache_line_size(2);
    unsigned int L3_cache_size = get_cache_line_size(3);
    //L1 size = 32768, L2 size = 262144, L3 size = 8388608
    printf("L1 size = %u, L2 size = %u, L3 size = %u\n", L1_cache_size, L2_cache_size, L3_cache_size);
}

Related questions
                            
                                Optimize mySql for faster alter table add column
                            
                                Java Performance Testing [duplicate]
                            
                                Why are there memory allocations when calling a func
                            
                                How are cache memories shared in multicore Intel CPUs?
                            
                                Compile-time constants and variables
                            
                                "SELECT COUNT(*)" is slow, even with where clause
                            
                                Array bounds check efficiency in .net 4 and above
                            
                                why are draw calls expensive?
                            
                                How to profile memory usage & performance with Instruments?
                            
                                Storing Documents as Blobs in a Database - Any disadvantages?
                            
                                Does the order of fields in a WHERE clause affect performance in MySQL?
                            
                                Optimizing member variable order in C++
                            
                                Why are difference lists more efficient than regular concatenation in Haskell?
                            
                                Why does concatenation of DataFrames get exponentially slower?
                            
                                Javascript prototype operator performance: saves memory, but is it faster?
                            
                                PostgreSQL UNIX domain sockets vs TCP sockets
                            
                                When should we use Radix sort?
                            
                                Why is math.sqrt massively slower than exponentiation?
                            
                                Relative performance of std::vector vs. std::list vs. std::slist?
                            
                                Practical limits of R data frame

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Where is the L1 memory cache of Intel x86 processors documented?

Tags:

performance

cpu-architecture

cpu-cache

intel

People also ask

Recent Activity

Donate For Us