I'm reading smphello.s code by allan cruse code
in the following part he's trying to setup stack segment for each processor.
the point is that he used xadd without using lock prefix while in the description of xadd as in here . there may be a lock prefix.
is this a bug or is it okay ? and why ?
# setup an exclusive stack-area for this processor
mov $0x1000, %ax # paragraphs in segment
xadd %ax, newSS # 'atomic' xchg-and-add
mov %ax, %ss # segment-address in SS
xor %esp, %esp # top-of-stack into ESP
xadd
without lock
is atomic wrt. interrupts on this core, but not wrt. code running on other cores (or DMA). Just like all other memory-destination RMWs other than xchg
. See this answer which covers the same question for cmpxchg
.
If this code runs simultaneously on multiple cores, 2 or more cores could read the same value for newSS
, effectively losing an increment and assigning the same ss:esp
to both cores. Or one store could be delayed across multiple xadd's by other cores that happen to be sequential, "rewinding" the counter to some previous value seen by later loads. Or any combination of problems. Can num++ be atomic for 'int num'?
Assuming newSS
is aligned, the load and store are separately atomic, but do not form an atomic RMW.
If multiple cores are woken simultaneously (are broadcast IPIs possible?) it seems very possible for this to be a real problem. If not, it's likely that each xadd
would finish before the next core gets to this code. (Including the store committing to L1d cache; becoming globally visible.) But that's just a "happens to work" behaviour unless the core-wakeup function waits to hear back from one core before waking another.
It definitely needs lock xadd
if it wants to match the comment about an atomic increment. Being atomic wrt. interrupts is fine if threads never run concurrently, only via context switch on a single core. (e.g. atomicity between the main thread and a signal handler, or interrupt handler on the same core). But since this is titled smphello.s
, uniprocessor assumptions seem unlikely.
Just to provide some empirical evidence for these theoretical arguments:
Here is a test case where several threads use xadd
to increment a shared counter. On an i7-8565U with 4 cores, it outputs
unlocked: counter = 1633267, expected 4000000
locked: counter = 4000000, expected 4000000
which clearly shows that xadd
without lock
is NOT atomic.
The code:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <pthread.h>
unsigned long counter = 0;
#define COUNTS_PER_THREAD 1000000UL
#define THREADS 4
void *unlocked_worker(void *unused) {
(void)unused;
for (unsigned long i = 0; i < COUNTS_PER_THREAD; i++) {
unsigned long inc = 1;
asm volatile("xaddq %0, %1" : "+r" (inc), "+m" (counter));
}
return NULL;
}
void *locked_worker(void *unused) {
(void)unused;
for (unsigned long i = 0; i < COUNTS_PER_THREAD; i++) {
unsigned long inc = 1;
asm volatile("lock; xaddq %0, %1" : "+r" (inc), "+m" (counter));
}
return NULL;
}
void run_threads(int lock) {
void *(*worker)(void *) = lock ? locked_worker : unlocked_worker;
counter = 0;
pthread_t th[THREADS];
for (int i = 0; i < THREADS; i++) {
int err = pthread_create(&th[i], NULL, worker, NULL);
if (err != 0) {
fprintf(stderr, "pthread_create: %s\n", strerror(err));
exit(1);
}
}
for (int i = 0; i < THREADS; i++) {
int err = pthread_join(th[i], NULL);
if (err != 0) {
fprintf(stderr, "pthread_join: %s\n", strerror(err));
exit(1);
}
}
printf("%s: counter = %lu, expected %lu\n",
lock ? "locked" : "unlocked",
counter, COUNTS_PER_THREAD * THREADS);
}
int main(void) {
run_threads(0);
run_threads(1);
return 0;
}
After another thought to it, another scenario for this case came to my mind.
if the microcode implementation of xadd
be like this:
temp = ax + newSS
newSS = ax
ax = temp ; the last 2 are actual xchg
then we have problem in this scenario:
Assume that newSS
is shared between 2 threads.
Thread No.0 (t0
with it's ax
equals to 5) loads and adds newSS
with ax
and put it into a temp register
.
Assume that At this point we have a context switch. Then t1
with it's ax
equals to 5 tries to load newSS
and add it to ax
and put the result in the temp register
. and then a context switch back to t0
...
Both stack segment registers will point to the same address.
Obviously we have a problem here. Unless the microcode implementation be like this:
tmp register = ax
xchg ax, newSS
ax = ax + tmpRegister
in any other way that variable newSS
is read more than once or read and written in different instructions, we need lock.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With