I have two versions of spinlock below. The first uses the default which is memory_order_cst while the latter uses memory_order_acquire/memory_order_release. Since the latter is more relaxed, I expect it to have better performance. However it doesn't seem to be case.
class SimpleSpinLock
{
public:
inline SimpleSpinLock(): mFlag(ATOMIC_FLAG_INIT) {}
inline void lock()
{
int backoff = 0;
while (mFlag.test_and_set()) { DoWaitBackoff(backoff); }
}
inline void unlock()
{
mFlag.clear();
}
private:
std::atomic_flag mFlag = ATOMIC_FLAG_INIT;
};
class SimpleSpinLock2
{
public:
inline SimpleSpinLock2(): mFlag(ATOMIC_FLAG_INIT) {}
inline void lock()
{
int backoff = 0;
while (mFlag.test_and_set(std::memory_order_acquire)) { DoWaitBackoff(backoff); }
}
inline void unlock()
{
mFlag.clear(std::memory_order_release);
}
private:
std::atomic_flag mFlag = ATOMIC_FLAG_INIT;
};
const int NUM_THREADS = 8;
const int NUM_ITERS = 5000000;
const int EXPECTED_VAL = NUM_THREADS * NUM_ITERS;
int val = 0;
long j = 0;
SimpleSpinLock spinLock;
void ThreadBody()
{
for (int i = 0; i < NUM_ITERS; ++i)
{
spinLock.lock();
++val;
j = i * 3.5 + val;
spinLock.unlock();
}
}
int main()
{
vector<thread> threads;
for (int i = 0; i < NUM_THREADS; ++i)
{
cout << "Creating thread " << i << endl;
threads.push_back(std::move(std::thread(ThreadBody)));
}
for (thread& thr: threads)
{
thr.join();
}
cout << "Final value: " << val << "\t" << j << endl;
assert(val == EXPECTED_VAL);
return 1;
}
I am running on Ubuntu 12.04 with gcc 4.8.2 running optimization O3.
-- Spinlock with memory_order_cst:
Run 1:
real 0m1.588s
user 0m4.548s
sys 0m0.052s
Run 2:
real 0m1.577s
user 0m4.580s
sys 0m0.032s
Run 3:
real 0m1.560s
user 0m4.436s
sys 0m0.032s
-- Spinlock with memory_order_acquire/release:
Run 1:
real 0m1.797s
user 0m4.608s
sys 0m0.100s
Run 2:
real 0m1.853s
user 0m4.692s
sys 0m0.164s
Run 3:
real 0m1.784s
user 0m4.552s
sys 0m0.124s
Run 4:
real 0m1.475s
user 0m3.596s
sys 0m0.120s
With the more relaxed model, I see a lot more variability. Sometimes it's better. Often times it is worse, does anyone have an explanation for this?
The generated unlock code is different. The CST memory model (with g++ 4.9.0) generates:
movb %sil, spinLock(%rip)
mfence
for the unlock. The acquire/release generates:
movb %sil, spinLock(%rip)
The lock code is the same. Someone else will have say something about why it's better with the fence, but if I had to guess, I would guess that it reduces bus/cache-coherence contention, possibly by reducing interference on the bus. Sometimes stricter is more orderly, and thus faster.
ADDENDUM: According to this, mfence costs around 100 cycles. So maybe you are reducing bus contention, because when a thread finishes the loop body, it pauses a bit before trying to reacquire the lock, letting the other thread finish. You could try to do the same thing by putting in a short delay loop after the unlock, though you'd have to make sure that it didn't get optimized out.
ADDENDUM2: It does seem to be caused by bus interference/contention caused by looping around too fast. I added a short delay loop like:
spinLock.unlock();
for (int i = 0; i < 5; i++) {
j = i * 3.5 + val;
}
Now, the acquire/release performs the same.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With