Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Assembly: does xadd instruction need lock?

I'm reading smphello.s code by allan cruse code

in the following part he's trying to setup stack segment for each processor.

the point is that he used xadd without using lock prefix while in the description of xadd as in here . there may be a lock prefix.

is this a bug or is it okay ? and why ?

# setup an exclusive stack-area for this processor
mov  $0x1000, %ax   # paragraphs in segment
xadd %ax, newSS     # 'atomic' xchg-and-add
mov  %ax, %ss       # segment-address in SS
xor  %esp, %esp     # top-of-stack into ESP
like image 391
becks Avatar asked May 08 '15 18:05

becks


3 Answers

xadd without lock is atomic wrt. interrupts on this core, but not wrt. code running on other cores (or DMA). Just like all other memory-destination RMWs other than xchg. See this answer which covers the same question for cmpxchg.

If this code runs simultaneously on multiple cores, 2 or more cores could read the same value for newSS, effectively losing an increment and assigning the same ss:esp to both cores. Or one store could be delayed across multiple xadd's by other cores that happen to be sequential, "rewinding" the counter to some previous value seen by later loads. Or any combination of problems. Can num++ be atomic for 'int num'?

Assuming newSS is aligned, the load and store are separately atomic, but do not form an atomic RMW.

If multiple cores are woken simultaneously (are broadcast IPIs possible?) it seems very possible for this to be a real problem. If not, it's likely that each xadd would finish before the next core gets to this code. (Including the store committing to L1d cache; becoming globally visible.) But that's just a "happens to work" behaviour unless the core-wakeup function waits to hear back from one core before waking another.

It definitely needs lock xadd if it wants to match the comment about an atomic increment. Being atomic wrt. interrupts is fine if threads never run concurrently, only via context switch on a single core. (e.g. atomicity between the main thread and a signal handler, or interrupt handler on the same core). But since this is titled smphello.s, uniprocessor assumptions seem unlikely.

like image 174
Peter Cordes Avatar answered Nov 15 '22 16:11

Peter Cordes


Just to provide some empirical evidence for these theoretical arguments:

Here is a test case where several threads use xadd to increment a shared counter. On an i7-8565U with 4 cores, it outputs

unlocked: counter = 1633267, expected 4000000
locked: counter = 4000000, expected 4000000

which clearly shows that xadd without lock is NOT atomic.

The code:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <pthread.h>

unsigned long counter = 0;

#define COUNTS_PER_THREAD 1000000UL
#define THREADS 4

void *unlocked_worker(void *unused) {
    (void)unused;
    for (unsigned long i = 0; i < COUNTS_PER_THREAD; i++) {
        unsigned long inc = 1;
        asm volatile("xaddq %0, %1" : "+r" (inc), "+m" (counter));
    }
    return NULL;
}

void *locked_worker(void *unused) {
    (void)unused;
    for (unsigned long i = 0; i < COUNTS_PER_THREAD; i++) {
        unsigned long inc = 1;
        asm volatile("lock; xaddq %0, %1" : "+r" (inc), "+m" (counter));
    }
    return NULL;
}

void run_threads(int lock) {
    void *(*worker)(void *) = lock ? locked_worker : unlocked_worker;
    counter = 0;
    pthread_t th[THREADS];
    for (int i = 0; i < THREADS; i++) {
        int err = pthread_create(&th[i], NULL, worker, NULL);
        if (err != 0) {
            fprintf(stderr, "pthread_create: %s\n", strerror(err));
            exit(1);
        }
    }
    for (int i = 0; i < THREADS; i++) {
        int err = pthread_join(th[i], NULL);
        if (err != 0) {
            fprintf(stderr, "pthread_join: %s\n", strerror(err));
            exit(1);
        }
    }
    printf("%s: counter = %lu, expected %lu\n",
           lock ? "locked" : "unlocked",
           counter, COUNTS_PER_THREAD * THREADS);
}

int main(void) {
    run_threads(0);
    run_threads(1);
    return 0;
}
like image 8
Nate Eldredge Avatar answered Nov 15 '22 14:11

Nate Eldredge


After another thought to it, another scenario for this case came to my mind.

if the microcode implementation of xadd be like this:

temp = ax + newSS
newSS = ax 
ax = temp ; the last 2 are actual xchg

then we have problem in this scenario:

Assume that newSS is shared between 2 threads.

Thread No.0 (t0 with it's ax equals to 5) loads and adds newSS with ax and put it into a temp register.

Assume that At this point we have a context switch. Then t1 with it's ax equals to 5 tries to load newSS and add it to ax and put the result in the temp register. and then a context switch back to t0... Both stack segment registers will point to the same address.

Obviously we have a problem here. Unless the microcode implementation be like this:

tmp register = ax
xchg ax, newSS
ax = ax + tmpRegister

in any other way that variable newSS is read more than once or read and written in different instructions, we need lock.

like image 3
AmirSojoodi Avatar answered Nov 15 '22 15:11

AmirSojoodi