Which is a better write barrier on x86: lock+addl or xchgl?

Question

The Linux kernel uses lock; addl $0,0(%%esp) as write barrier, while the RE2 library uses xchgl (%0),%0 as write barrier. What's the difference and which is better?

Does x86 also require read barrier instructions? RE2 defines its read barrier function as a no-op on x86 while Linux defines it as either lfence or no-op depending on whether SSE2 is available. When is lfence required?

Fabian Giesen · Accepted Answer

Quoting from the IA32 manuals (Vol 3A, Chapter 8.2: Memory Ordering):

In a single-processor system for memory regions defined as write-back cacheable, the memory-ordering model respects the following principles [..]

Reads are not reordered with other reads

Writes are not reordered with older reads

Writes to memory are not reordered with other writes, with the exception of

writes executed with the CLFLUSH instruction

streaming stores (writes) executed with the non-temporal move instructions ([list of instructions here])

string operations (see Section 8.2.4.1)

Reads may be reordered with older writes to different locations but not with older writes to the same location.

Reads or writes cannot be reordered with I/O instructions, locked instructions, or serializing instructions

Reads cannot pass LFENCE and MFENCE instructions

Writes cannot pass SFENCE and MFENCE instructions

Note: The "In a single-processor system" above is slightly misleading. The same rules hold for each (logical) processor individually; the manual then goes on to describe the additional ordering rules between multiple processors. The only bit about it pertaining to the question is that

Locked instructions have a total order.

In short, as long as you're writing to write-back memory (which is all memory you'll ever see as long as you're not a driver or graphics programmer), most x86 instructions are almost sequentially consistent - the only reordering an x86 CPU can perform is reorder later (independent) reads to execute before writes. The main thing about the write barriers is that they have a lock prefix (implicit or explicit), which forbids all reordering and ensures that the operations is seen in the same order by all processors in a multi-processor system.

Also, in write-back memory, reads are never reordered, so there's no need for read barriers. Recent x86 processors have a weaker memory consistency model for streaming stores and write-combined memory (commonly used for mapped graphics memory). That's where the various fence instructions come into play; they're not necessary for any other memory type, but some drivers in the Linux kernel do deal with write-combined memory so they just defined their read-barrier that way. The list of ordering model per memory type is in Section 11.3.1 in Vol. 3A of the IA-32 manuals. Short version: Write-Through, Write-Back and Write-Protected allow speculative reads (following the rules as detailed above), Uncachable and Strong Uncacheable memory has strong ordering guarantees (no processor reordering, reads/writes are immediately executed, used for MMIO) and Write Combined memory has weak ordering (i.e. relaxed ordering rules that need fences).

GJ. · Answer

The "lock; addl $0,0(%%esp)" is faster in case that we testing the 0 state of lock variable at (%%esp) address. Because we add 0 value to lock variable and the zero flag is set to 1 if the lock value of variable at address (%%esp) is 0.

lfence from Intel datasheet:

Performs a serializing operation on all load-from-memory instructions that were issued prior the LFENCE instruction. This serializing operation guarantees that every load instruction that precedes in program order the LFENCE instruction is globally visible before any load instruction that follows the LFENCE instruction is globally visible.

(Editor's note: mfence or a locked operation is the only useful fence (after a store) for sequential consistency. lfence does not block StoreLoad reordering by the store buffer.)

For instance: memory write instruction like 'mov' are atomic (they don't need lock prefix) if they are properly aligned. But this instruction is normally executed in CPU cache and will not be globally visible at this moment for all other threads, because memory fence must be performed first to make this thread wait until previous stores are visible to other threads.

So the main difference between these two instructions is that xchgl instruction will not have any effect on the conditional flags. Certainly we can test the lock variable state with lock cmpxchg instruction but this is still more complex than with lock add $0 instruction.

Which is a better write barrier on x86: lock+addl or xchgl?

Tags:

x86

assembly

memory-barriers

Hongli

2 Answers

Fabian Giesen

GJ.

Recent Activity

Donate For Us

Which is a better write barrier on x86: lock+addl or xchgl?

Tags:

x86

assembly

memory-barriers

Hongli

2 Answers

Fabian Giesen

GJ.

Related questions

Recent Activity

Donate For Us