Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the benefit of the MOESI cache coherency protocol over MESI?

I was wondering what benefits MOESI has over the MESI cache coherency protocol, and which protocol is currently favored for modern architectures. Oftentimes benefits don't translate to implementation if the costs don't allow it. Quantitative performance results of MOESI over MESI would be nice to see also.

like image 560
Nathan Doromal Avatar asked Apr 23 '18 14:04

Nathan Doromal


2 Answers

AMD uses MOESI, Intel uses MESIF. (I don't know about non-x86 cache details.)

MOESI allows sending dirty cache lines directly between caches instead of writing back to a shared outer cache and then reading from there. The linked wiki article has a bit more detail, but it's basically about sharing dirty data. The Owned state keeps track of which cache is responsible for writing back dirty the data.

MESIF allows caches to Forward a copy of a clean cache line to another cache, instead of other caches having to re-read it from memory to get another Shared copy. (Intel since Nehalem already uses a single large shared L3 cache for all cores, so all requests are ultimately backstopped by one L3 cache before checking memory anyway, but that's for all cores on one socket. Forwarding apply between sockets in a multi-socket system. Until Skylake-AVX512, the large shared L3 cache was inclusive. Which cache mapping technique is used in intel core i7 processor?)

Wikipedia's MESIF article (linked above) has some comparison between MOESI and MESIF.


AMD in some cases has lower latency for sharing the same cache line between 2 cores. For example, see this graph of inter-core latency for Ryzen vs. quad-core Intel vs. many-core Intel (ring bus: Broadwell) vs. Skylake-X (worst).

Obviously there are many other differences between Intel and AMD designs that affect inter-core latency, like Intel using a ring bus or mesh, and AMD using a crossbar / all-to-all design with small clusters. (e.g. Ryzen has clusters of 4 cores that share an L3. That's why the inter-core latency for Ryzen has another step from core #3 to core #4.)

BTW, notice that the latency between two logical cores on the same physical core is much lower for Intel and AMD. What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?.

I didn't look for any academic papers that simulated MESI vs. MOESI on an otherwise-similar model.

Choice of MESIF vs. MOESI can be influenced by other design factors; Intel's use of a large tag-inclusive L3 shared cache as a backstop for coherency traffic is their solution to the same problem that MOESI solves: traffic between cores is handled efficiently with write-back to L3 then sending the data from L3 to the requesting core, in the case where a core had the line in Modified state in a private L2 or L1d.

IIRC, some AMD designs (like some versions of Bulldozer-family) didn't have a last-level cache shared by all cores, and instead had larger L2 caches shared by pairs of cores. Higher-performance BD-family CPUs did also have a shared cache, though, so at least clean data could hit in L3.

like image 70
Peter Cordes Avatar answered Jan 04 '23 05:01

Peter Cordes


MOESI is almost always superior to MESI in terms of absolute performance. However, MESI only requires 2 bits per cache line to hold the state, while MOESI requires 3 bits per cache line. Therefore, for smaller cache lines, the relative area overhead of MOESI increases. This may not be justified when the type of applications in the target domain exhibit very little writes to shared cache lines. Even the additional power or static energy overhead may not be tolerable in certain domains. For these reasons, MOESI might be too expensive for low-energy/low-performance/small processors. That is, MOESI would be less efficient in terms of performance-per-watt or performance-per-joule. ARM11 uses MESI. ARM Cortex-A57 uses MESI at L1 and MOESI at L2. Note that the decision of using a particular coherence protocol is not made independently of making decisions regarding other aspects of the the cache hierarchy, the interconnect, and the number of cores. These parameters influence each other.

like image 37
Hadi Brais Avatar answered Jan 04 '23 06:01

Hadi Brais