Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Do spin locks always require a memory barrier? Is spinning on a memory barrier expensive?

I wrote some lock-free code that works fine with local reads, under most conditions.

Does local spinning on a memory read necessarily imply I have to ALWAYS insert a memory barrier before the spinning read?

(To validate this, I managed to produce a reader/writer combination which results in a reader never seeing the written value, under certain very specific conditions--dedicated CPU, process attached to CPU, optimizer turned all the way up, no other work done in the loop--so the arrows do point in that direction, but I'm not entirely sure about the cost of spinning through a memory barrier.)

What is the cost of spinning through a memory barrier if there is nothing to be flushed in the cache's store buffer? i.e., all the process is doing (in C) is

while ( 1 ) {
    __sync_synchronize();
    v = value;
    if ( v != 0 ) {
        ... something ...
    }
}

Am I correct to assume that it's free and it won't encumber the memory bus with any traffic?

Another way to put this is to ask: does a memory barrier do anything more than: flush the store buffer, apply the invalidations to it, and prevent the compiler from reordering reads/writes across its location?


Disassembling, __sync_synchronize() appears to translate into:

lock orl

From the Intel manual (similarly nebulous for the neophyte):

Volume 3A: System Programming Guide, Part 1 --   8.1.2

Bus Locking

Intel 64 and IA-32 processors provide a LOCK# signal that
is asserted automatically during certain critical memory
operations to lock the system bus or equivalent link.
While this output signal is asserted, requests from other
processors or bus agents for control of the bus are
blocked.

[...]

For the P6 and more recent processor families, if the
memory area being accessed is cached internally in the
processor, the LOCK# signal is generally not asserted;
instead, locking is only applied to the processor’s caches
(see Section 8.1.4, “Effects of a LOCK Operation on
Internal Processor Caches”).

My translation: "when you say LOCK, this would be expensive, but we're only doing it where necessary."


@BlankXavier:

I did test that if the writer does not explicitly push out the write from the store buffer and it is the only process running on that CPU, the reader may never see the effect of the writer (I can reproduce it with a test program, but as I mentioned above, it happens only with a specific test, with specific compilation options and dedicated core assignments--my algorithm works fine, it's only when I got curious about how this works and wrote the explicit test that I realized it could potentially have a problem down the road).

I think by default simple writes are WB writes (Write Back), which means they don't get flushed out immediately, but reads will take their most recent value (I think they call that "store forwarding"). So I use a CAS instruction for the writer. I discovered in the Intel manual all these different types of write implementations (UC, WC, WT, WB, WP), Intel vol 3A chap 11-10, still learning about them.

My uncertainty is on the reader's side: I understand from McKenney's paper that there is also an invalidation queue, a queue of incoming invalidations from the bus into the cache. I'm not sure how this part works. In particular, you seem to imply that looping through a normal read (i.e., non-LOCK'ed, without a barrier, and using volatile only to insure the optimizer leaves the read once compiled) will check into the "invalidation queue" every time (if such a thing exists). If a simple read is not good enough (i.e. could read an old cache line which still appears valid pending a queued invalidation (that sounds a bit incoherent to me too, but how do invalidation queues work then?)), then an atomic read would be necessary and my question is: in this case, will this have any impact on the bus? (I think probably not.)

I'm still reading my way through the Intel manual and while I see a great discussion of store forwarding, I haven't found a good discussion of invalidation queues. I've decided to convert my C code into ASM and experiment, I think this is the best way to really get a feel for how this works.

like image 706
blais Avatar asked Jul 25 '11 00:07

blais


1 Answers

The "xchg reg,[mem]" instruction will signal its lock intention over the LOCK pin of the core. This signal weaves its way past other cores and caches down to the bus-mastering buses (PCI variants etc) which will finish what they are doing and eventually the LOCKA (acknowledge) pin will signal the CPU that the xchg may complete. Then the LOCK signal is shut off. This sequence can take a long time (hundreds of CPU cycles or more) to complete. Afterwards the appropriate cache lines of the other cores will have been invalidated and you will have a known state, i e one that has ben synchronized between the cores.

The xchg instruction is all that is neccessary to implement an atomic lock. If the lock itself is successful you have access to the resource that you have defined the lock to control access to. Such a resource could be a memory area, a file, a device, a function or what have you. Still, it is always up to the programmer to write code that uses this resource when it's been locked and doesn't when it hasn't. Typically the code sequence following a successful lock should be made as short as possible such that other code will be hindered as little as possible from acquiring access to the resource.

Keep in mind that if the lock wasn't successful you need to try again by issuing a new xchg.

"Lock free" is an appealing concept but it requires the elimination of shared resources. If your application has two or more cores simultaneously reading from and writing to a common memory address "lock free" is not an option.

like image 172
Olof Forshell Avatar answered Sep 20 '22 23:09

Olof Forshell