Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

x86-64 usage of LFENCE

I'm trying to understand the right way to use fences when measuring time with RDTSC/RDTSCP. Several questions on SO related to this have already been answered elaborately. I have gone through a few of them. I have also gone through this really helpful article on the same topic: http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf

However, in another online blog, there's an example of using LFENCE instead of CPUID on x86. I was wondering how LFENCE prevents earlier stores from contaminating the RDTSC measurements. E.g.

<Instr A>
LFENCE/CPUID
RDTSC
<Code to be benchmarked>
LFENCE/CPUID
RDTSC 

In the above case, LFENCE ensures all earlier loads it complete before it (Since SDM says: LFENCE instructions cannot pass earlier reads.). But what about earlier stores (say, Instr A was a Store)? I understand why CPUID works because it IS a serialization instruction, but LFENCE is not.

One explanation I found was in Intel SDM VOL 3A Section 8.3, the following footnote:

LFENCE does provide some guarantees on instruction ordering. It does not execute until all prior instructions have completed locally, and no later instruction begins execution until LFENCE completes.

So essentially LFENCE acts like an MFENCE. In that case, why do we need two separate instructions LFENCE and MFENCE?

I'm probably missing something.

Thanks in advance.

like image 431
Chandan Avatar asked May 26 '16 06:05

Chandan


2 Answers

The key point is the adverb locally in the quoted sentence "It does not execute until all prior instructions have completed locally".

I was unable to find a clear definition of "complete locally" the whole set of Intel manual, my speculation is explained below.


In order to be completed locally an instruction must have it output computed and available to the other instructions further down in its dependency chain. Furthermore any side effect of that instruction must be visible inside the core.

In order to be completed globally an instruction must have its side effects visible to other system components (like other CPUs).

If we don't qualify the kind of "completeness" we are talking about it generally means it don't care or it is implicit in the context.


For a lot of instructions being completed locally and globally, it is the same.
For a load for example, in order to be completed locally, some data must be fetched from memory or caches. This is the same as being completed globally, since we cannot mark the load complete if we don't read from the memory hierarchy first.

For a store however the situation is different.

Intel processors have a Store Buffer to handle writes to memory, from Chapter 11.10 of the manual 3:

Intel 64 and IA-32 processors temporarily store each write (store) to memory in a store buffer. The store buffer improves processor performance by allowing the processor to continue executing instructions without having to wait until a write to memory and/or to a cache is complete. It also allows writes to be delayed for more efficient use of memory-access bus cycles.

So a store can be completed locally by being put in the store buffer, from the core perspective the write is like it have gone all the way to the memory.
A load from the same core of the store, under specific circumstances, can even read back that value (this is called Store Forwarding).

To be completed globally however a store need to be drained from the Store Buffer.

Finally is mandatory to add that the Store Buffer is drained by Serializing instructions:

The contents of the store buffer are always drained to memory in the following situations:
• (P6 and more recent processor families only) When a serializing instruction is executed.
• (Pentium III, and more recent processor families only) When using an SFENCE instruction to order stores.
• (Pentium 4 and more recent processor families only) When using an MFENCE instruction to order stores.


Being done with the introduction, let's see what lfence, mfence and sfence do:

LFENCE does not execute until all prior instructions have completed locally, and no later instruction begins execution until LFENCE completes.

MFENCE performs a serializing operation on all load-from-memory and store-to-memory instructions that were issued prior the MFENCE instruction. MFENCE does not serialize the instruction stream.

SFENCE performs a serializing operation on all store-to-memory instructions that were issued prior the SFENCE instruction.

So lfence is weaker form of serialization that doesn't drain the Store Buffer, since it effectively serialize instructions locally, all loads before it must be completed before it completes.

sfence serializes stores only, it basically doesn't allow the process to execute any more store until sfence is retired. It also drains the Store buffer.

mfence is not a simple combination of the two because it is not serializing in the classical sense, it is a sfence that also prevent future loads to be executed.


It may be worth nothing that sfence was introduced first and the other twos came later to achieve a more granular control over the memory ordering.

Finally, I was used to close a rdtsc instruction between two lfence instructions, to be sure no reordering "backward" and "forward" was possible.
However I'm sure about this technique soundness.

like image 87
Margaret Bloom Avatar answered Nov 16 '22 19:11

Margaret Bloom


As you rightfully observed, it is a matter of serialization. Regarding to your question

why do we need two separate instructions LFENCE and MFENCE?

is answered in the Intel SDM in section "5.6.4 - SSE2 Cacheability Control and Ordering Instructions":

LFENCE Serializes load operations
MFENCE Serializes load and store operations

So LFENCE is probably used because MFENCE isn't necessary for RDTSC.

like image 32
zx485 Avatar answered Nov 16 '22 20:11

zx485