Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CMPXCHG16B correct?

This doesn't exactly seem to be right although I am unsure why. Advice would be great as the documentation for CMPXCHG16B is pretty minimal (I don't own any intel manuals...)

template<>
inline bool cas(volatile types::uint128_t *src, types::uint128_t cmp, types::uint128_t with)
{
    /*
    Description:
     The CMPXCHG16B instruction compares the 128-bit value in the RDX:RAX and RCX:RBX registers 
     with a 128-bit memory location. If the values are equal, the zero flag (ZF) is set, 
     and the RCX:RBX value is copied to the memory location. 
     Otherwise, the ZF flag is cleared, and the memory value is copied to RDX:RAX.
     */
    uint64_t * cmpP = (uint64_t*)&cmp;
    uint64_t * withP = (uint64_t*)&with;
    unsigned char result = 0;
    __asm__ __volatile__ (
    "LOCK; CMPXCHG16B %1\n\t"
    "SETZ %b0\n\t"
    : "=q"(result)  /* output */ 
    : "m"(*src), /* input */
      //what to compare against
      "rax"( ((uint64_t) (cmpP[1])) ), //lower bits
      "rdx"( ((uint64_t) (cmpP[0])) ),//upper bits
      //what to replace it with if it was equal
      "rbx"( ((uint64_t) (withP[1])) ), //lower bits
      "rcx"( ((uint64_t) (withP[0]) ) )//upper bits
    : "memory", "cc", "rax", "rdx", "rbx","rcx" /* clobbered items */
    );
    return result;
}

When running with an example I am getting 0 when it should be 1. Any ideas?

like image 849
JH. Avatar asked Jan 28 '11 06:01

JH.


2 Answers

It's good to note that if you're using GCC, you don't need to use inline asm to get at this instruction. You can use one of the __sync functions, like:

template<>
inline bool cas(volatile types::uint128_t *src,
                types::uint128_t cmp,
                types::uint128_t with)
{
    return __sync_bool_compare_and_swap(src, cmp, with);
}

Microsoft has a similar function for VC++:

__int64 exchhi = __int64(with >> 64);
__int64 exchlo = (__int64)(with);

return _InterlockedCompareExchange128(a, exchhi, exchlo, &cmp) != 0;
like image 199
Cory Nelson Avatar answered Sep 21 '22 17:09

Cory Nelson


Here are some alternatives compared:

  1. Inline assembly, such as @luke h's answer.

  2. __sync_bool_compare_and_swap(): a GNU extension, gcc/clang/ICC-only, deprecated, pseudo-function that the compiler will issue a CMPXCHG16B instruction for at least with -mcx16

  3. atomic_compare_exchange_weak() / strong: the C11 pseudo-function that does what atomic<> does in C++11. For GNU, this will NOT issue a CMPXCHG16B in gcc 7 and later, but instead calls libatomic (which must therefore be linked to). Dynamically linked libatomic will decide what version of functions to use based on what the CPU is capable of, and on machines where the CPU is capable of CMPXCHG16B it will use that.

  4. apparently clang will still inline CMPXCHG16B for atomic_compare_exchange_weak() or strong

I haven't tried the machine language, but looking at the disassembly of (2), it looks perfect and I don't see how (1) could beat it. (I have little knowledge of x86 but programmed a lot of 6502.) Further, there's ample advice never to use assembly if you can avoid it, and it can be avoided at least with gcc/clang. So I can cross (1) off the list.

Here's the code for (2) in gcc version 9.2.1 20190827 (Red Hat 9.2.1-1) (GCC):

Thread 2 "mybinary" hit Breakpoint 1, MyFunc() at myfile.c:586
586               if ( __sync_bool_compare_and_swap( &myvar,
=> 0x0000000000407262 <MyFunc+904>:       49 89 c2        mov    %rax,%r10
   0x0000000000407265 <MyFunc+907>:       49 89 d3        mov    %rdx,%r11
(gdb) n
587                                                  was, new ) ) {
=> 0x0000000000407268 <MyFunc+910>:       48 8b 45 a0     mov    -0x60(%rbp),%rax
   0x000000000040726c <MyFunc+914>:       48 8b 55 a8     mov    -0x58(%rbp),%rdx
(gdb) n
586               if ( __sync_bool_compare_and_swap( &myvar,
=> 0x0000000000407270 <MyFunc+918>:       48 c7 c6 00 d3 42 00    mov    $0x42d300,%rsi
   0x0000000000407277 <MyFunc+925>:       4c 89 d3        mov    %r10,%rbx
   0x000000000040727a <MyFunc+928>:       4c 89 d9        mov    %r11,%rcx
   0x000000000040727d <MyFunc+931>:       f0 48 0f c7 8e 70 04 00 00      lock cmpxchg16b 0x470(%rsi)
   0x0000000000407286 <MyFunc+940>:       0f 94 c0        sete   %al

Then doing hammer tests of (2) and (3) on real-world algos, I am seeing no real performance difference. Even in theory, (3) only has the overhead of one additional function call and some work in the libatomic wrapper function, including a branch on whether the CAS succeeded or not.

(With lazy dynamic linking, the first call into the libatomic function will actually run an init function that uses CPUID to check if your CPU has cmpxchg16b. Then it will update the GOT function-pointer that the PLT stub jumps through, so future calls will go straight to libat_compare_exchange_16_i1 that uses lock cmpxchg16b. The i1 suffix in the name is from GCC's ifunc mechanism for function multi-versioning; if you ran it on a CPU without cmpxchg16b support, it would resolve the shared library function to a version that used locking.)

In my real-world-ish hammer tests, that function call overhead is lost in the amount of CPU taken by the functionality the lock-free mechanism is protecting. Therefore, I don't see a reason to use the __sync, functions, that are compiler-specific and deprecated to boot.

Here's the assembly for the libatomic wrapper that gets called for each .compare_exchange_weak(), from single-stepping through the assembly on my Fedora 31. If compiled with -fno-plt, a callq *__atomic_compare_exchange_16@GOTPCREL(%rip) will get inlined into the caller, avoiding the PLT and running the CPU-detection early, at program startup time instead of on the first call.

Thread 2 "tsquark" hit Breakpoint 2, 0x0000000000403210 in 
__atomic_compare_exchange_16@plt ()
=> 0x0000000000403210 <__atomic_compare_exchange_16@plt+0>:     ff 25 f2 8e 02 00       jmpq   *0x28ef2(%rip)        # 0x42c108 <[email protected]>
(gdb) disas
Dump of assembler code for function __atomic_compare_exchange_16@plt:
=> 0x0000000000403210 <+0>:     jmpq   *0x28ef2(%rip)        # 0x42c108 <[email protected]>
   0x0000000000403216 <+6>:     pushq  $0x1e
   0x000000000040321b <+11>:    jmpq   0x403020
End of assembler dump.
(gdb) s
Single stepping until exit from function __atomic_compare_exchange_16@plt,
...

0x00007ffff7fab250 in libat_compare_exchange_16_i1 () from /lib64/libatomic.so.1
=> 0x00007ffff7fab250 <libat_compare_exchange_16_i1+0>: f3 0f 1e fa     endbr64
(gdb) disas
Dump of assembler code for function libat_compare_exchange_16_i1:
=> 0x00007ffff7fab250 <+0>:     endbr64
   0x00007ffff7fab254 <+4>:     mov    (%rsi),%r8
   0x00007ffff7fab257 <+7>:     mov    0x8(%rsi),%r9
   0x00007ffff7fab25b <+11>:    push   %rbx
   0x00007ffff7fab25c <+12>:    mov    %rdx,%rbx
   0x00007ffff7fab25f <+15>:    mov    %r8,%rax
   0x00007ffff7fab262 <+18>:    mov    %r9,%rdx
   0x00007ffff7fab265 <+21>:    lock cmpxchg16b (%rdi)
   0x00007ffff7fab26a <+26>:    mov    %r9,%rcx
   0x00007ffff7fab26d <+29>:    xor    %rax,%r8
   0x00007ffff7fab270 <+32>:    mov    $0x1,%r9d
   0x00007ffff7fab276 <+38>:    xor    %rdx,%rcx
   0x00007ffff7fab279 <+41>:    or     %r8,%rcx
   0x00007ffff7fab27c <+44>:    je     0x7ffff7fab288 <libat_compare_exchange_16_i1+56>
   0x00007ffff7fab27e <+46>:    mov    %rax,(%rsi)
   0x00007ffff7fab281 <+49>:    xor    %r9d,%r9d
   0x00007ffff7fab284 <+52>:    mov    %rdx,0x8(%rsi)
   0x00007ffff7fab288 <+56>:    mov    %r9d,%eax
   0x00007ffff7fab28b <+59>:    pop    %rbx
   0x00007ffff7fab28c <+60>:    retq
End of assembler dump.

The only benefit I've found to using (2) is if your machine didn't ship with a libatomic (true of older Red Hat's) and you don't have the ability to request sysadmins provide this or don't want to count on them installing the right one. I personally downloaded one in source code and built it incorrectly, so that 16-byte swaps ended up using a mutex: disaster.

I haven't tried (4). Or rather I started to but so many warnings/errors on code that gcc passed without comment that I wasn't able to get it compiled in the budgeted time.

Note that while options 2, 3, and 4 seem like the same code or nearly the same code should work, in fact all three have substantially different checks and warnings and even if you have one of the three compiling fine and without warning in -Wall, you may get a lot more warnings or errors if you try one of the other options. The __sync* pseudo-functions aren't well-documented. (In fact the documentation only mentions 1/2/4/8 bytes, not that they work for 16 bytes. Meanwhile, they work "kind of" like function templates but you don't see the template and they seem finicky about whether the first and second arg types actually are the same type in a way that atomic_* aren't.) In short it wasn't the 3 minute job to compare 2, 3, and 4 that you might guess.

like image 28
Swiss Frank Avatar answered Sep 18 '22 17:09

Swiss Frank