While I'm familiar with concurrent programming concepts such as mutexes and semaphores, I have never understood how they are implemented at the assembly language level. I imagine there being a set of memory "flags" saying: <ul> <li>lock A is held by thread 1</li> <li>lock B is held by thread 3</li> <li>lock C is not held by any thread</li> <li>etc</li> </ul> But how is access to these flags synchronized between threads? Something like this naive example would only create a race condition: <pre class="prettyprint"><code> mov edx, [myThreadId] wait: cmp [lock], 0 jne wait mov [lock], edx ; I wanted an exclusive lock but the above ; three instructions are not an atomic operation :( </code></pre>

<ul> <li>In practice, these tend to be implemented with CAS and LL/SC. (...and some spinning before giving up the time slice of the thread - usually by calling into a kernel function that switches context.)</li> <li>If you only need a spinlock, wikipedia gives you an example which trades CAS for lock prefixed <code>xchg</code> on x86/x64. So in a strict sense, a CAS is not needed for crafting a spinlock - but some kind of atomicity is still required. In this case, it makes use of an atomic operation that can write a register to memory and return the previous contents of that memory slot in a single step. (To clarify a bit more: the lock prefix asserts the #LOCK signal that ensures that the current CPU has exclusive access to the memory. On todays CPUs it is not necessarily carried out this way, but the effect is the same. By using <code>xchg</code> we make sure that we will not get preempted somewhere between reading and writing, since instructions will not be interrupted half-way. So if we had an imaginary lock mov reg0, mem / lock mov mem, reg1 pair (which we don't), that would not quite be the same - it could be preempted just between the two movs.)</li> <li>On current architectures, as pointed out in the comments, you mostly end up using the atomic primitives of the CPU and the coherency protocols provided by the memory subsystem.</li> <li>For this reason, you not only have to use these primitives, but also account for the cache/memory coherency guaranteed by the architecture.</li> <li>There may be implementation nuances as well. Considering e.g. a spinlock: <ul> <li>instead of a naive implementation, you should probably use e.g. a TTAS spin-lock with some exponential backoff, </li> <li>on a Hyper-Threaded CPU, you should probably issue <code>pause</code> instructions that serve as hints that you're spinning - so that the core you are running on can do something useful during this</li> <li>you should really give up on spinning and yield control to other threads after a while</li> <li>etc...</li> </ul> </li> <li>this is still user mode - if you are writing a kernel, you might have some other tools that you can use as well (since you are the one that schedules threads and handles/enables/disables interrupts).</li> </ul>

The x86 architecture, has long had an instruction called <code>xchg</code> which will exchange the contents of a register with a memory location. xchg has always been atomic. There has also always been a <code>lock</code> prefix that could be applied to <strike>any</strike> a single instruction to make that instruction atomic. Before there were multi processor systems, all this really did was to prevent an interrupt from being delivered in the middle of a locked instruction. (xchg was implicitly locked). This article has some sample code using xchg to implement a spinlock http://en.wikipedia.org/wiki/Spinlock When multi CPU and later multi Core systems began to be built, more sophisticated systems were needed to insure that lock and xchg would synchronize all of the memory subsystems, including l1 cache on all of the processors. About this time, new research into locking and lockless algorithms showed that atomic CompareAndSet was a more flexible primitive to have, so more modern CPUs have that as an instruction. Addendum: In comments andras supplied a "dusty old" list of instructions which allow the <code>lock</code> prefix. http://pdos.csail.mit.edu/6.828/2007/readings/i386/LOCK.htm

How is thread synchronization implemented, at the assembly language level?

Tags:

While I'm familiar with concurrent programming concepts such as mutexes and semaphores, I have never understood how they are implemented at the assembly language level.

I imagine there being a set of memory "flags" saying:

lock A is held by thread 1
lock B is held by thread 3
lock C is not held by any thread
etc

But how is access to these flags synchronized between threads? Something like this naive example would only create a race condition:

  mov edx, [myThreadId] wait:   cmp [lock], 0   jne wait   mov [lock], edx   ; I wanted an exclusive lock but the above    ; three instructions are not an atomic operation :(

383

asked Mar 03 '10 01:03

Martin

2 Answers

In practice, these tend to be implemented with CAS and LL/SC. (...and some spinning before giving up the time slice of the thread - usually by calling into a kernel function that switches context.)
If you only need a spinlock, wikipedia gives you an example which trades CAS for lock prefixed xchg on x86/x64. So in a strict sense, a CAS is not needed for crafting a spinlock - but some kind of atomicity is still required. In this case, it makes use of an atomic operation that can write a register to memory and return the previous contents of that memory slot in a single step. (To clarify a bit more: the lock prefix asserts the #LOCK signal that ensures that the current CPU has exclusive access to the memory. On todays CPUs it is not necessarily carried out this way, but the effect is the same. By using xchg we make sure that we will not get preempted somewhere between reading and writing, since instructions will not be interrupted half-way. So if we had an imaginary lock mov reg0, mem / lock mov mem, reg1 pair (which we don't), that would not quite be the same - it could be preempted just between the two movs.)
On current architectures, as pointed out in the comments, you mostly end up using the atomic primitives of the CPU and the coherency protocols provided by the memory subsystem.
For this reason, you not only have to use these primitives, but also account for the cache/memory coherency guaranteed by the architecture.
There may be implementation nuances as well. Considering e.g. a spinlock:
- instead of a naive implementation, you should probably use e.g. a TTAS spin-lock with some exponential backoff,
- on a Hyper-Threaded CPU, you should probably issue pause instructions that serve as hints that you're spinning - so that the core you are running on can do something useful during this
- you should really give up on spinning and yield control to other threads after a while
- etc...
this is still user mode - if you are writing a kernel, you might have some other tools that you can use as well (since you are the one that schedules threads and handles/enables/disables interrupts).

answered Oct 09 '22 00:10

Andras Vass

The x86 architecture, has long had an instruction called xchg which will exchange the contents of a register with a memory location. xchg has always been atomic.

There has also always been a lock prefix that could be applied to ~~any~~ a single instruction to make that instruction atomic. Before there were multi processor systems, all this really did was to prevent an interrupt from being delivered in the middle of a locked instruction. (xchg was implicitly locked).

This article has some sample code using xchg to implement a spinlock http://en.wikipedia.org/wiki/Spinlock

When multi CPU and later multi Core systems began to be built, more sophisticated systems were needed to insure that lock and xchg would synchronize all of the memory subsystems, including l1 cache on all of the processors. About this time, new research into locking and lockless algorithms showed that atomic CompareAndSet was a more flexible primitive to have, so more modern CPUs have that as an instruction.

Addendum: In comments andras supplied a "dusty old" list of instructions which allow the lock prefix. http://pdos.csail.mit.edu/6.828/2007/readings/i386/LOCK.htm

answered Oct 09 '22 00:10

John Knoeller

Related questions
                            
                                How to horizontally sort divs using Sortable in JQuery
                            
                                SVN X remains in tree-conflict
                            
                                How to clear the entire second level cache in NHibernate
                            
                                How do I enable continuations in Scala?
                            
                                What does it mean to "preconcat" a matrix in Android?
                            
                                ways to avoid global temp tables in oracle
                            
                                static initialization order fiasco
                            
                                Create a random string or number in Qt4
                            
                                Plotting over multiple pages
                            
                                Hello world in Prolog
                            
                                How can I get Visual Studio to beep at me if a build succeeds?
                            
                                How do I tell cmake I want my project to link libraries statically?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With