I found the following spinlock code in boost::smart_ptr
:
bool try_lock()
{
return (__sync_lock_test_and_set(&v_, 1) == 0);
}
void lock()
{
for (unsigned k=0; !try_lock(); ++k)
{
if (k<4)
; // spin
else if (k < 16)
__asm__ __volatile__("pause"); // was ("rep; nop" ::: "memory")
else if (k < 32 || k & 1)
sched_yield();
else
{
struct timespec rqtp;
rqtp.tv_sec = 0;
rqtp.tv_nsec = 100;
nanosleep(&rqtp, 0);
}
}
}
void unlock()
{
__sync_lock_release(&v_);
}
So if I understand this correctly, when the lock is contended the incoming thread will exponentially back-off, first spinning wildly, then pausing, then yielding the remainder of its time slice, and finally flip-flopping between sleeping and yielding.
I also found the glibc pthread_spinlock
implementation, which uses assembly to perform the lock.
#define LOCK_PREFIX "lock;" // using an SMP machine
int pthread_spin_lock(pthread_spinlock_t *lock)
{
__asm__ ("\n"
"1:\t" LOCK_PREFIX "decl %0\n\t"
"jne 2f\n\t"
".subsection 1\n\t"
".align 16\n"
"2:\trep; nop\n\t"
"cmpl $0, %0\n\t"
"jg 1b\n\t"
"jmp 2b\n\t"
".previous"
: "=m" (*lock)
: "m" (*lock));
return 0;
}
I will admit that my understanding of assembly is not great, so I don't fully understand what is happening here. (Could someone please explain what this is doing?)
However, I ran some tests against the boost spinlock and glibc pthread_spinlock, and when there are more cores than threads, the boost code outperforms the glibc code.
On the other hand, When there are more threads than cores, the glibc code is better.
Why is this? What is the difference between these two spinlock implementations that makes them perform differently in each scenario?
Where did you get the pthread_spin_lock()
implementation posted in the question? It seems to be missing a couple important lines.
The implementation I see (which isn't inline assembly - it's a stand-alone assembly source file from glibc/nptl/sysdeps/i386/pthread_spin_lock.S
) looks similar, but has the two additional critical instructions:
#include <lowlevellock.h>
.globl pthread_spin_lock
.type pthread_spin_lock,@function
.align 16
pthread_spin_lock:
mov 4(%esp), %eax
1: LOCK
decl 0(%eax)
jne 2f
xor %eax, %eax
ret
.align 16
2: rep
nop
cmpl $0, 0(%eax)
jg 1b
jmp 2b
.size pthread_spin_lock,.-pthread_spin_lock
It decrements a long
pointed to by the passed in parameter and returns if the result is zero.
Otherwise, the result was non-zero which means this thread didn't acquire the lock. So it performs a rep nop
, which is equivalent to the pause
instruction. This is a 'special' nop that gives a hint to the CPU that the thread is in a spin, and the cpu should handle memory ordering and/or branch prdiction in some manner that improves performance in these situations (I don't pretend to understand exactly what happens differently under the chip's covers - from the software's point of view, there's no difference from a plain old nop
).
After the pause
it checks the value again - if it's greater than zero, the lock is unclaimed so it jumps to the top of the function and tries to claim the lock again. Otherwise, it jumps to the pause
again.
The main difference between this spinlock and the Boost version is that this one never does anything fancier than a pause
when it's spinning - there's nothing like a sched_yield()
or nanosleep()
. So the thread stays hot. I'm not sure precisely how this plays in the two behaviors you noted, but the glibc code will be greedier - if a thread is spinning on the lock and there are other threads ready to run but no available core, the spinning thread doesn't help the waiting thread get any cpu time, while the Boost version will eventually voluntarily make way for the thread that's waiting for some attention.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With