Atomic 16 byte read on x64 CPUs

Question

I need to read/write 16 bytes atomically. I do the writing only using cmpxchg16, which is available on all x64 processors except I think for one obscure AMD one.

Now the question is for aligned 16 byte values, only ever modified using cmpxchg16 (which acts like a full memory barrier) is it ever possible to read a 16 byte location that's half old data and half new data?

As long as I read with a SSE instruction (so the thread cannot be interrupted in the middle of the read) I think that it's impossible (even in multiprocessor numa systems) for the read to see inconsistent data. I think it must be atomic.

I am making the assumption that when cmpxchg16 is executed, it modifies the 16 bytes atomically, not by writing two 8 byte blocks with the potential for other threads to do a read in between (honestly I don't see how it could work if it wasn't atomic.)

Am I right? If I'm wrong, is there a way to do an atomic 16 byte read without resorting to locking?

Note: There are a couple similar questions here but they don't deal with the case where the writes are done only with cmpxchg16, so I feel this is a seperate, unanswered question.

Edit: Actually I think my reasoning was faulty. An SSE load instruction may be executed as two 64bit reads, and it may be possible for the cmpxchg16 to be executed in between the two reads by another processor.

kay · Accepted Answer

typedef struct
{
  unsigned __int128 value;
} __attribute__ ((aligned (16))) atomic_uint128;

unsigned __int128
atomic_read_uint128 (atomic_uint128 *src)
{
  unsigned __int128 result;
  asm volatile ("xor %%rax, %%rax;"
                "xor %%rbx, %%rbx;"
                "xor %%rcx, %%rcx;"
                "xor %%rdx, %%rdx;"
                "lock cmpxchg16b %1" : "=A"(result) : "m"(*src) : "rbx", "rcx");
  return result;
}

That should do the trick. The typedef ensures correct alignment. The cmpxchg16b needs the data to be aligned on a 16 byte boundary.

The cmpxchg16b will test if *src contains a zero and write a zero if so (nop). In either case the correct value will stand in RAX:RDX afterwards.

The code above evaluates to something as simple as

push   %rbx
xor    %rax,%rax
xor    %rbx,%rbx
xor    %rcx,%rcx
xor    %rdx,%rdx
lock cmpxchg16b (%rdi)
pop    %rbx
retq

Dan · Answer

According to references http://siyobik.info/main/reference/instruction/CMPXCHG8B%2FCMPXCHG16B the CMPXCHG16 is not by default atomic but can be made atomic by using LOCK http://siyobik.info/main/reference/instruction/LOCK

That means that by default, data can be changed within the read and write phases. Locking makes both the read and write atomic.

Atomic 16 byte read on x64 CPUs

Tags:

c++

c

64-bit

sse

lock-free

Eloff

2 Answers

kay

Dan

Recent Activity

Donate For Us

Atomic 16 byte read on x64 CPUs

Tags:

c++

c

64-bit

sse

lock-free

Eloff

2 Answers

kay

Dan

Related questions

Recent Activity

Donate For Us