Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Implementing a ticket lock with atomics generates extra mov

I wrote a naive implementation of a simple ticket lock. The locking part looks like:

struct ticket {
    uint16_t next_ticket;
    uint16_t now_serving;
};

void lock(ticket* tkt) {
    const uint16_t my_ticket =
        __sync_fetch_and_add(&tkt->next_ticket, 1); 
    while (tkt->now_serving != my_ticket) {
        _mm_pause();
        __asm__ __volatile__("":::"memory");
    }   
}

Then I realized that rather than using a gcc intrinsic, I can write this with std::atomics:

struct atom_ticket {
    std::atomic<uint16_t> next_ticket;
    std::atomic<uint16_t> now_serving;
};

void lock(atom_ticket* tkt) {
    const uint16_t my_ticket =
        tkt->next_ticket.fetch_add(1, std::memory_order_relaxed);
    while (tkt->now_serving.load(std::memory_order_relaxed) != my_ticket) {
        _mm_pause();
    }   
}

These generate almost identical assembly, but the latter generates an additional movzwl instruction. Why is there this extra mov? Is there a better, correct way to write lock()?

Assembly output with -march=native -O3:

 0000000000000000 <lock(ticket*)>:
    0:   b8 01 00 00 00          mov    $0x1,%eax
    5:   66 f0 0f c1 07          lock xadd %ax,(%rdi)
    a:   66 39 47 02             cmp    %ax,0x2(%rdi)
    e:   74 08                   je     18 <lock(ticket*)+0x18>
   10:   f3 90                   pause  
   12:   66 39 47 02             cmp    %ax,0x2(%rdi)
   16:   75 f8                   jne    10 <lock(ticket*)+0x10>
   18:   f3 c3                   repz retq 
   1a:   66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)

 0000000000000020 <lock(atom_ticket*)>:
   20:   ba 01 00 00 00          mov    $0x1,%edx
   25:   66 f0 0f c1 17          lock xadd %dx,(%rdi)
   2a:   48 83 c7 02             add    $0x2,%rdi
   2e:   eb 02                   jmp    32 <lock(atom_ticket*)+0x12>
   30:   f3 90                   pause  
=> 32:   0f b7 07                movzwl (%rdi),%eax <== ???
   35:   66 39 c2                cmp    %ax,%dx
   38:   75 f6                   jne    30 <lock(atom_ticket*)+0x10>
   3a:   f3 c3                   repz retq 

Why not just cmp (%rdi),%dx directly?

like image 624
Barry Avatar asked Oct 22 '15 15:10

Barry


1 Answers

First of all, I think you need to use std::memory_order_acquire, since you're acquiring a lock. If you use mo_relaxed, you could potentially see stale data from before some of the stores that the previous lock-holder did. Jeff Preshing's blog is excellent, and he has a post on release/acquire semantics.

On x86, this can only happen if the compiler re-orders loads and stores, which mo_relaxed tells it it's allowed to. An acquire load compiles the same as a relaxed load on x86, but without reordering. Every x86 asm load is already an acquire. On weakly-ordered architectures that need it, you'll get whatever instructions are needed for a load-acquire. (And on x86, you'll just stop the compiler from reordering).


I put a version of the code on godbolt to look at the asm with various compilers.


Well spotted, this does look like a gcc optimization failure, still present at least in 6.0 (checked with Wandbox, using a main that does return execlp("objdump", "objdump", "-Mintel", "-d", argv[0], NULL); to dump disassembler output of itself, including the functions we're interested in.

It looks like clang 3.7 does even worse with this. It does a 16bit load, then zero-extends, then compares.

gcc treats atomic loads specially, and apparently fails to notice it can fold it into the compare. Probably the optimization pass that could have done that happened while the atomic load was still represented differently from regular loads, or something. I'm not a gcc hacker, so this is mostly guesswork.

I suspect either you have an old gcc (4.9.2 or older), or you're building on / for AMD, because your compiler used rep ret even with -march=native. You should do something about that if you care about generating optimal code. I've noticed gcc5 making better code than gcc 4.9 sometimes. (not that it helps in this case, though :/)


I tried using uint32_t, with no luck.

The performance impact of doing the load and compare separately is probably irrelevant since this function is a busy-wait loop.

The fast path (unlocked case, where the loop condition is false on the first iteration) is still only one taken branch, and a ret. However, in the std:atomic version, the fast path goes through the loop branch. So instead of two separate branch-predictor entries (one for the fast path and one for the spin-loop), now spinning will probably cause a branch mispredict in the next unlocked case. This is probably not an issue, and the new code does take one fewer branch-predictor entries.

IDK if jumping into the middle of a loop had any ill effects on the uop cache of Intel SnB-family microarchitectures. It is something of a trace cache. Agner Fog's testing found that the same piece of code can have multiple entries in the uop cache if it has multiple jump entry points. This function is already somewhat uop-cache unfriendly, since it starts with a mov r, imm / lock xadd. The lock xadd has to go in a uop cache line by itself, because it's microcoded (more than 4 uops. 9 in fact). An unconditional jump always ends a uop cache line. I'm not sure about a taken conditional branch, but I'd guess a taken jcc ends a cache line if it was predicted taken when it was decoded. (e.g. branch predictor entry still good, but old uop cache entry evicted).

So the first version is potentially 3 uops cache lines for the fast-path: one mov (and if inlined, hopefully mostly full with previous instructions), one lock xadd alone, one macro-fused cmp/je to following code (if inlined. If not, the jump targets the ret, which may end up as a 4th cache line for this 32byte code block, which isn't allowed. So the non-inlined version of this may always have to get re-decoded every time?)

The std::atomic version is again one uop-cache line for the initial mov imm (and preceding instructions), then lock xadd, then add / jmp, then ... uh oh, 4th cache-line needed for the movzx / compare-and-branch uops. So this version is more likely to have a decode bottleneck even when inlined.

Fortunately, the frontend can still gain some ground and get instructions queued up for the OOO core while running this code, because lock xadd is 9 uops. That's enough to cover a cycle or two of fewer uops from the frontend, and a switch between decoding and uop-cache fetching.

The main problem here is just code-size, since you prob. want this inlined. Speed-wise, the fast-path is only slightly worse, and the non-fast path is a spin-loop anyway so it doesn't matter.

The fast-path is 11 fused-domain uops for the old version (1 mov imm, 9 lock xadd, 1 cmp/je macro fused). The cmp/je includes a micro-fused memory operand.

The fast-path is 41 fused-domain uops for the new version (1 mov imm, 9 lock xadd, 1 add, 1 jmp, 1 movzx, 1 cmp/je macro fused).

Using an add instead of just using an 8bit offset in the addressing mode of movzx is really shooting itself in the foot, IMO. IDK if gcc thinks ahead far enough to make choices like that to have the loop branch target come out at a 16B boundary, or if that was just dumb luck.


Compiler-identification experiment on godbolt using the OP's code:

  • gcc 4.8 and earlier: always use rep ret when it's a branch target, even with -march=native -mtune=core2 (on a Haswell), or with just -march=core2.
  • gcc 4.9: Uses rep ret with -march=native on Haswell, probably because Haswell is too new for it. -march=native -mtune=haswell uses just ret, so it does know the name haswell.
  • gcc 5.1 and later: uses ret with -march=native (on Haswell). Still uses rep ret when -march isn't specified.
like image 156
Peter Cordes Avatar answered Nov 03 '22 21:11

Peter Cordes