Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do Intel Xeon CPUs write to memory?

I'm trying to decide between two algorithms. One writes 8 bytes (two aligned 4-byte words) to 2 cache lines, the other writes 3 entire cache lines.

If the CPU writes only the changed 8 bytes back to memory, then the first algorithm uses much less memory bandwidth: 8 bytes vs 192 bytes. If the CPU writes entire cache lines, then the difference between 128 and 192 bytes is less striking.

So how does a Intel Xeon CPU write back to memory? You'd be surprised how hard it is to find an answer in Google to something that should be well known.

As I understand it, the writes go into the store buffer, and then to the cache. They might only get written to memory when the dirty cache line is evicted from the cache, but does Intel track which parts of the cache line are dirty, or just dump the entire thing? I rather doubt that they track things below cache line granularity. I would also be very surprised if anything goes to memory before the cache line is evicted.

like image 522
Eloff Avatar asked Jul 25 '15 20:07

Eloff


People also ask

Does CPU write to cache?

Overview. When trying to read from or write to a location in the main memory, the processor checks whether the data from that location is already in the cache. If so, the processor will read from or write to the cache instead of the much slower main memory.

Does Xeon support DDR4?

For example: Intel® Xeon® Silver Processors support DDR4 memory with a maximum speed of 2400 MHz. DDR4-2666 memory may be used with these processors, but will be used at a maximum speed of 2400 MHz if supported by both the memory and motherboard.

What architecture does Intel Xeon use?

The Intel Xeon processor Scalable family mesh architecture encompasses an array of vertical and horizontal communication paths allowing traversal from one core to another through a shortest path (hop on vertical path to correct row, and hop across horizontal path to correct column).

Does each core have L3 cache?

Each core has its own L1 and L2 caches, while the L3 cache, also called the Last Level Cache or LLC, is shared among cores.


1 Answers

Locality matters even for DRAM itself, even discounting caching. A burst write of 64B contiguous bytes for a dirty cache-line is a lot faster than 16 writes of 4B to 16 different addresses. Or to put it another way, writing back an entire cache line is not much slower than writing back just a few changed bytes in a cache line.

What Every Programmer Should Know About Memory, by Ulrich Drepper, explains a lot of stuff about avoiding memory bottlenecks when programming. He includes some details of DRAM addressing. DRAM controllers have to select a row, and then select a column. Accessing another virtual memory page can also cause a TLB miss.

DRAM does have a burst-transfer command for transferring a sequential chunk of data. (Obviously designed for the benefit of CPUs writing back cache lines). The memory system in modern computers is optimized for the usage-pattern of writing whole cache lines, because that's what almost always happens.

Cache lines are the unit at which CPUs track dirty-or-not. It would be possible to track dirtyness with a smaller line size than present-or-not cache lines, but that would take extra transistors and isn't worth it. The multiple levels of cache are set up to transfer whole cache lines around, so they can be as fast as possible when a whole cache line needs to be read.

There are so-called non-temporal reads/writes (movnti/movntdqa) that bypass the cache. These are for use with data that won't be touched again until it would have been evicted from the cache anyway (hence the non-temporal). They are a bad idea for data that could benefit from caching, but would let you write 4 bytes to memory, rather than a whole cache line. Depending on the MTRR for that memory range, the write might or might not be subject to write-combining. (This is relevant for memory-mapped i/o regions, where two adjacent 4B writes isn't the same as one 8B write.)

The algorithm that only touches two cache lines certainly has the advantage on that score, unless it takes a lot more computation, or especially branching, to figure out which memory to write. Maybe ask a different question if you want help deciding. (see the links at https://stackoverflow.com/tags/x86/info, esp Agner Fog's guides, for info that will help you decide for yourself.)

See Cornstalks' answer for warnings about the dangers of having multiple threads on different CPUs touching the same memory. This can lead to bigger slowdowns than just extra writes for a single-threaded program.

like image 137
Peter Cordes Avatar answered Sep 24 '22 21:09

Peter Cordes