Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it worse in any aspect to use the CMPXCHG instruction on an 8-bit field than on a 32-bit field?

I'd like to ask if using a CMPXCHG instruction on an 8-bit memory field would be worse in any aspect than using it on a 32-bit field.

I'm using C11 stdatomic.h to implement a couple of synchronization methods.

like image 249
Dewr Avatar asked Oct 03 '19 07:10

Dewr


People also ask

How many operands are there in the ‘CMPXCHG’ instruction?

•According to Intel’s manual, the ‘cmpxchg’ instruction also uses two ‘implicit’ operands (i.e., operands not mentioned in the instruction) –The CPU’s accumulator register –The CPU’s EFLAGS register

What is the size of the instruction in 64 bit mode?

In 64-bit mode, the instruction’s default operation size is 32 bits. Use of the REX.R prefix permits access to additional registers (R8-R15).

What is the difference between ‘BTR’/’BTS’ and ‘CMPXCHG’?

•In an earlier lesson we used the ‘btr’/’bts’ instructions to achieve ‘mutual exclusion’, whereas Linux uses ‘cmpxchg’ to do that •We think ‘btr’/’bts’ is easier to understand, so why do you think the Linux developers would prefer to use ‘cmpxchg’ instead? <allow some class discussion here>


1 Answers

No, there's no penalty for lock cmpxchg [mem], reg 8 vs. 32-bit. Modern x86 CPUs can load and store to their L1d cache with no penalty for a single byte vs. an aligned dword or qword. Can modern x86 hardware not store a single byte to memory? answer: it can with zero penalty1 because they spend the transistors to make even unaligned loads/stores fast.

The surrounding asm instructions dealing with a narrow integer in a register should also have negligible if any extra cost vs. [u]int32_t. See Why doesn't GCC use partial registers? - most compilers know how to be careful with partial registers, and modern CPUs (Haswell and later, and all non-Intel) don't rename the low 8 separately from the rest of the register so the only danger is false dependencies. Depending on exactly what you're doing, it might be best to use unsigned local temporaries with an _Atomic uint8_t, or it might be best to make your locals also uint8_t.

Footnote 1: Unlike on some non-x86 CPUs where a byte store actually is implemented with a cache RMW cycle (Are there any modern CPUs where a cached byte store is actually slower than a word store?). On those CPUs you'd hope that atomic xchg would be just as cheap for word vs. byte, but that's too much to hope for with cmpxchg. But almost all non-x86 ISAs have LL/SC instead of xchg / cmpxchg anyway, so even an atomic exchange is separate LL and SC instructions, and the SC would be take an RMW cycle to commit to cache.

like image 116
Peter Cordes Avatar answered Nov 02 '22 05:11

Peter Cordes