I came across this SO question and reading it over eventually led me to look at boost::detail::spinlock_pool
.
The purpose of boost::detail::spinlock_pool
is to reduce potential contention for a global spinlock by choosing from an array of spinlock
s by hashing over the shared_ptr
's address. This seems like a reasonable solution but there seems to be a problem with the current (Boost v1.49) version's implementation.
spinlock_pool
manages a statically allocated array of 41 spinlock
instances. It appears that sizeof(spinlock)==4
for the platforms I looked at -- which means on, say x64 with 64-byte cachelines, there will be 16 spinlock
s per cache line.
I.e. the whole array spans all of 2 1/2 cache lines.
I.e. there's a 40% chance of one random spinlock false sharing with another.
... which almost completely defeats the purpose of the pool in the first place.
Is my analysis correct or am I missing something important?
UPDATE: I did finally write a small benchmark program:
#include <boost/shared_ptr.hpp>
#include <boost/thread.hpp>
#include <boost/timer.hpp>
#include <iostream>
#include <vector>
#include <stdlib.h>
using namespace std;
enum { BufferSize = 1<<24, SLsPerCacheLine = 1 };
int ibuffer[BufferSize];
using boost::detail::spinlock;
size_t nslp = 41;
spinlock* pslp = 0;
spinlock& getSpinlock(size_t h)
{
return pslp[ (h%nslp) * SLsPerCacheLine ];
}
void threadFunc(int offset)
{
const size_t mask = BufferSize-1;
for (size_t ii=0, index=(offset&mask); ii<BufferSize; ++ii, index=((index+1)&mask))
{
spinlock& sl = getSpinlock(index);
sl.lock();
ibuffer[index] += 1;
sl.unlock();
}
};
int _tmain(int argc, _TCHAR* argv[])
{
if ( argc>1 )
{
size_t n = wcstoul(argv[1], NULL, 10);
if ( n>0 )
{
nslp = n;
}
}
cout << "Using pool size: "<< nslp << endl;
cout << "sizeof(spinlock): "<< sizeof(spinlock) << endl;
cout << "SLsPerCacheLine: "<< int(SLsPerCacheLine) << endl;
const size_t num = nslp * SLsPerCacheLine;
pslp = new spinlock[num ];
for (size_t ii=0; ii<num ; ii++)
{ memset(pslp+ii,0,sizeof(*pslp)); }
const size_t nThreads = 4;
boost::thread* ppThreads[nThreads];
const int offset[nThreads] = { 17, 101, 229, 1023 };
boost::timer timer;
for (size_t ii=0; ii<nThreads; ii++)
{ ppThreads[ii] = new boost::thread(threadFunc, offset[ii]); }
for (size_t ii=0; ii<nThreads; ii++)
{ ppThreads[ii]->join(); }
cout << "Elapsed time: " << timer.elapsed() << endl;
for (size_t ii=0; ii<nThreads; ii++)
{ delete ppThreads[ii]; }
delete[] pslp;
return 0;
}
I compiled two versions of the code, one with SLsPerCacheLine==1
, and one with SLsPerCacheLine==8
. 32bit optimized using MSVS 2010, run on a 4-core Xeon W3520 @ 2.67Ghz (HyperThreading disabled).
I had trouble getting consistent results out of these tests -- spurious timing variations up to 50% were occasionally observed. On average however, it appears the SLsPerCacheLine==8
version was ~25-30% faster than the SLsPerCacheLine==1
version with a spinlock table of size 41.
It would be interesting to see how this scales with larger number of cores, NUMA, HyperThreading, etc. I don't presently have access to that sort of hardware.
In general, false sharing can be reduced using the following techniques: Make use of private or threadprivate data as much as possible. Use the compiler's optimization features to eliminate memory loads and stores. Pad data structures so that each thread's data resides on a different cache line.
False sharing occurs when threads on different processors modify variables that reside on the same cache line.
BRIEF
You are correct, in that the spinlocks are sharing the same cache line, when they could possibly be in different cache lines. Aka false sharing. And there may well be some performance gain to be had by allocating the locks in different cache lines (but see below).
However, this is NOT the same as lock contention. Lock contention arises when a lock is held by one guy, and one or more other guys want to access it.
I.e. spinlock_pool has cache line contention caused by locks co-resident in the same cache line. But it has (reduced) software lock contention.
The reduced software lock contention is almost undoubtedy good.
The cache line contention probably hurts a bit, as your benchmarks show to some degree, but is a second order effect compared to the software lock contention.
DETAIL
Background
First, some background:
Classic spinloops are test-and-test-and-set:
loop:
tmp := load( memory_location_of_sw_lock )
if( is_locked(tmp) ) goto loop
was_locked := atomic_hw_locked_rmw( memory_location_of_sw_lock,
value_that_will_get_sw_lock )
if( was_locked(tmp) ) goto loop
got_the_sw_lock:
...
... critical section
...
// release the lock, e.g.
memory_location_of_sw_lock := 0
There are also test-and-set spinlocks, which look like
loop:
was_locked := atomic_hw_locked_rmw( memory_location_of_sw_lock,
value_that_will_get_sw_lock )
if( was_locked(tmp) ) goto loop
These can have very bad performance on most modern memory systems with caches, write-through or write-back. (Although some of the hardware optimizations that I have proposed make test-and-set spinloops as fast as test-and-test-and-set spinloops - slightly faster because they are smaller.)
Note that there are two different concepts of lock here: the "software" lock that the spinlock is acquiring, and the hardware lock that is used by the atomic_hw_locked_rmw instruction, such as Intel LOCK INC mem or CMPXCHG. We don;t care about the latter much, except to know that it usually unconditionally writes to the cache line holding the software lock, invalidated other copies of the cache line. (Making the write conditional is another possible hardware optimization.)
O(N^2) burst of cache misses on (software) lock contention
Lock contention with test-and-test-and-set spinloops is particularly bad. The waiters all spin on the lock, and when it is released there is a burst of bus accesses. One guy wins, the others realize that they have lost, and eventually they settle down to spin again. This burst of activity is particularly bad because for N waiting guys (threads/processes/processors) the burst of bus activity can be O(N^2) in size, because in the worst case everyone exits the test- part of a test-and-test-and-set spinloop, and everyone tries to do the atomic locked RMW (read-modify-write) instruction, like x86 LOCK INC mem or CMPXCHG, at the same time. This means that everyone will eventually write the line, even though all but the first don't need to write the lock because they won't get the lock.
E.g.
Lock is held by P0
P1-PN are spinning in test-and-test-and-set spinloops waiting for it.
P0 releases the lock, e.g. by writing it to 0
P1's "test-" instruction reads the lock
...
PN's "test-" instruction reads the lock
All of P1-PN see the lock as released,
so they fall out of the "test-" part of the spinloop which uses ordinary instructions
to the test-and-set part of the spinloop, which uses a hardware atomic RMW like LOCK INC mem
P1's locked RMW happens, and acquires the spinlock for P1
It invalidates all other cache lines
P1 goes away and does something with the data protected by the lock
P2's locked RMW fails to acquire the spinlock
It invalidates all other caches because it has a write
P1 falls back into its test- spinloop
P3's locked RMW fails
It invalidates all other caches because it has a write
P1 falls back into its test- spinloop
...
PN's locked RMW fails
and now, at the very least, all of the remaining P2..PN processors must do ordinary unlocked cache misses for their test- spinloop. This implies at least N+(N-1) cache misses. It can be considerably worse, because it is possible for each write, by a waiter who will fail to acquire the lock, to trigger all of the other waiters to do an unlocked read. I.e., depending on timing, you may get
1 release the lock
N reads to test- the lock
1 locked atomic RMW to acquire the lock
N reads...
for each of the remaining N-1 processors
1 locked RMW that fails to acquire the lock
N reads
which is O(N^2). Or possibly, for processor M, 1 locked RMW, followed by 2..M reads -- which is still O(N^2).
What does this mean for this question?
OK, if there was true lock contention you get this O(N^2) burst of cache misses when a contended lock is released.
However, the spinlock_pool spreads the waiters out across multiple locks. If there are S locks in the spinlock pool, you get far fewer waiters: probably fewer than N/S (because contention tends to be superlinear in the number of people sharing the same lock).
I.e. with Boost's spinlock_pool, you might naively expect to get 1/41-th the amount of contention --- and probably less than that.
Remember, spinlock_pool is a compromise between having a lock per shared_pointer, increasing the size of the shared pointers, and having all of the shared_pointers share the same lock. So any contention on a shared_pointer's spinlock may be (a) true contention, or (b) false contention caused by shared_pointers that are independent hashing to the same entry in the spinlock_pool.
Now, yes, if instead of having N waiters you have "only" N/41 waiters, then the burst is still O((N/41)^2) or O(N^2). But, if N typically less than 41... you get the idea.
Basically, hashing to spread the shared_Pointers out over multiple spinlock_pool entries quickly reduces the amount of contention.
But... the spinlocks live in the same cacheline? Right... but waiters on other cache lines will not advance to the test-and-set part of their spinloop.
I.e. if a contended lock with M waiters shares a cache line with M other processes, you will get M*N traffic. But if the hashing has reduced M to O(1), you only get N traffic.
And if most of the time there are no other waiters at all, then you get only O(1) raffic on a lock release.
BOTTOM LINE:
Reducing software lock contention is much higher benefit performance-wise than reducing cache false sharing that causes hardware cache line contention.
Now, there still may be benefit to be hard in not packing so many spinlock_pool entries into the same cache line. But this is by no means obvious; it is an empirical question, by which I mean that you need to run an experiment, and it will vary with workload.
Sometimes false sharing of these locks in the same cache line will be bad. A classic example is the locks that protect processor run queues.
Sometimes false sharing of these locks in the same cache line is good - it can provide the same benefits that prefetch does in a cache. E.g. imagine that you have allocated an array of shared_pointers, and that people usually access aosptr[i] and then aosptr[i+1], etc.
It all depends on your workload. I've seen it fall both ways. Usually in my experience one lock per cache line is better, although often not by much.
MORE FUN
In case you care, hardware optimization for test-and-test-and-set spinloops was the subject of my MS thesis, "A Feature Taxonomy of Synchronization Primitives, including the Bus Abandonment Lock" - a hardware update cache protocol that eliminated the burst of bus accesses. Not published in any formal publication, except for University Microfiche.
My work was inspired by the paper "The performance of spin lock alternatives for shared-memory multiprocessors" by T Anderson - IEEE Transactions on Parallel and Distributed Systems , 1990.
I think it's fair to say that most publications have been on software, algorithms, including the famous MCS lock. I think it is also fair to say that such techniques, while popular in theory, have been little used by journeymen programmers.
By the way, there are some more hardware optimizations possible here. E.g. CMPXCHG need not write the lock at all. Unfortunately, on current Intel hardware, or at least on the Intel hardware I designed in 1991, which I still think is used, the only way to release the hardware lock (that is used to implement the atomic locked RMW) is to use a special microcode write "store-unlock". Heck, the microcode could even use LoadLinked/StoreConditional (LL/SC), falling back to locking in those situations where a naive LL/SC makes no forward progress on some threads.
It is possible that Intel has fixed this recently. I left Intel in 2009. I was trying to get it fixed, improved, optimized, oh, since 1991. And Intel has been improving the performance of locks a lot recently. But I think they have mainly been worked on uncontended lock performance, and have not been optimizing contended lock performance.
Similarly, Alain Kagi, in his dissertation and in some publications, as well as patent http://www.google.com/patents/US6460124, shows that by adding judicious delays the cache hardware can make spinlocks as efficient as queued locks.
Some of these hardware optimizations make test-and-set spinloops better than test-and-test-and-set spinloops.
The most recent development has been Intel TSX (Transactional Synchronization Extensions), consisting of HLE (Hardware Lock Elision (Ravi Rajwar worked on this at UWisc while I and Alain Kagi were there, although my work in synchronization was earlier at UIUC)) and RTM (Restricted Transactional Memory). Both of these help uncontended .. hmm, they help with contended locks that are falsely contending, e.g. a coarse grain lock protecting what could be independent stuff. I.e. false lock contention. In some ways, HLE may make spinlock_pool unnecessary.
APOLOGY
I am sorry: I have provided a long winded answer that boils down to "reducing software lock contention can be much more important than hardware cache line contention for spinloops".
While I would expect some performance to be gained from allocating 1 spinloop per cache line, it may be small, and on some not-exactly-uncommon workloads there may even be a loss.
And to be sure, you would have to measure it.
But, in any case, the big gain comes from reducing software lock contention.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With