In general, most widely known implementations of reference-counting smart ptr classes in C++, including the standard std::shared_ptr
, use atomic reference counting, but do not provide atomic access to the same smart ptr instance. In other words, multiple threads may safely operate on separate shared_ptr
instances which point to the same shared object, but multiple threads cannot safely read/write instances of the same shared_ptr
instance without providing some kind of synchronization such as a mutex or whatever.
An atomic version of a shared_ptr
called "atomic_shared_ptr
" has been proposed, and preliminary implementations already exist. Presumably, atomic_shared_ptr
could easily be implemented with a spin lock or mutex, but a lock-free implementation is also possible.
After studying some of these implementations, one thing is obvious: implementing a lock-free std::shared_ptr
is very difficult, and seems to require so many compare_and_exchange
operations as to make me question whether a simple spin lock would achieve better performance.
The main reason that it's so difficult to implement a lock-free reference-counted pointer is because of the race that always exists between reading the shared control block (or the shared object itself, if we're talking about an intrusive shared pointer), and modifying the reference count.
In other words, you can't even safely read the reference count because you never know when some other thread has deallocated the memory where the reference count lives.
So in general, various complex strategies are employed to create lock-free versions. The implementation here looks like it uses a double-reference count strategy, where there are "local" references which count the number of threads concurrently accessing the shared_ptr
instance, and then "shared" or "global" references which count the number of shared_ptr instances pointing to the shared object.
Given all this complexity, I was really surprised to find a Dr. Dobbs article, from 2004 no less (way before C++11 atomics) that seems to nonchalantly solve this entire problem:
http://www.drdobbs.com/atomic-reference-counting-pointers/184401888
It looks like the author is claiming to somehow be able to:
"... [read] the pointer to the counter, increments the counter, and returns the pointer—all in such a manner that no other threads can cause an incorrect result"
But I don't really understand the way he actually implements this. He's using (non-portable) PowerPC instructions (the LL/SC primitives lwarx
and stwcx
) to pull this off.
The relevant code that does this is what he calls an "aIandF
" (atomic increment and fetch), which he defines as:
addr aIandF(addr r1){ addr tmp;int c; do{ do{ tmp = *r1; if(!tmp)break; c = lwarx(tmp); }while(tmp != *r1); }while(tmp && !stwcx(tmp,c+1)); return tmp; };
Apparently, addr
is a pointer type pointing to the shared object that owns the reference count variable.
My question(s) is:, is this only possible to do with an architecture that supports LL/SC operations? It seems it would be impossible to do with cmpxchg
. And secondly, how exactly does this work? I've read over this code a few times now, and I can't really understand what's going on. I understand what LL/SC primitives do, I just can't make any sense of the code.
The best I can understand is that addr r1
is the address of the pointer to the shared object, and also the address of the pointer to the reference count (which I guess means the reference count variable needs to be the first member of the struct
that defines the shared object). He then dereferences addr
(getting the actual address of the shared object). Then, he linked loads the value stored at the address in tmp
, and stores the result in c
. This is the counter value. He then conditionally stores that value incremented (which will fail if tmp
has changed) back into tmp
.
What I don't understand is how this works. The address of the shared object may never change and the LL/SC could succeed - but how does this help us if another thread has deallocated the shared object in the mean time?
Reference counting is one such technique. This method is simply keeping an extra counter along with each object that is created. The counter is the number of references that exist to the object, in the C/C++ case this would how many pointers refer to this object.
std::shared_ptr:All operations, that do not change the reference count, are as lock-free as the corresponding operations on a raw pointer.
The smart pointer allocates a small block of memory to contain the reference counter. Each copy of the smart pointer then receives a pointer to the actual object and a pointer to the reference count.
The shared reference counter counts the number of owners. Copying a std::shared_ptr increases the reference count by one. Destroying a std::shared_ptr decreases the reference count by one. If the reference count becomes zero, the resource will automatically be released.
addr aIandF(addr r1) { addr tmp; int c; do { do { // r1 holds the address of the address // of the refcount tmp = *r1; // grab the address of the refcount if (!tmp) break; // if it's null, bail // read current refcount // and acquire reservation c = lwarx(tmp); // now we hold the reservation, // check to see if another thread // has changed the shared block address } while (tmp != *r1); // if so, start over // if the store succeeds we know we held // the reservation throughout } while (tmp && !stwcx(tmp, c+1)); return tmp; };
Note that aIandF
is used specifically when constructing a copy of an existing shared pointer, claiming a reference for the copy.
The Dr Dobbs article describes the operation for releasing a reference as first atomically swapping the address of the shared counter in the source shared pointer object with a null pointer local to the function; then atomically decrementing the counter; then testing to see if the result of the decrement was zero. This order of operations is important: you say, "The address of the shared object may never change and the LL/SC could succeed - but how does this help us if another thread has deallocated the shared object in the mean time?" - but this can never happen, since the object will never be deallocated without the swap happening first, giving us a means to observe the change of address.
aIandF
tests for the counter address being null on entry.
It can spot the address becoming null if that occurs before the lwarx
, because it explicitly tests for this once it has the reservation.
If the swap in the decrementing thread occurs after the lwarx we do not actually care: if the stwcx
in aIandF
succeeds, we know the decrementing thread will see the new reference count and not destruct the object, and we can proceed knowing we have claimed a reference to it; whereas if the other thread succeeds in decrementing the counter first, we will lose our reservation, the store will fail and we'll detect the destruction of the object on the next loop iteration.
This algorithm assumes a strongly consistent memory model (all threads always see effects of each other's reads and writes in program order) - this is not necessarily the case even on those modern architectures that do support ll/sc.
EDIT: thinking about it, the algorithm also apparently makes the assumption that it is always safe to read from a memory address that was once valid (e.g., no MMU/protection; or, the algorithm is broken):
if (!tmp) break; // another thread could, at this point, do its swap, // decrement *and* destroy the object tmp points to // before we get to do anything else c = lwarx(tmp); // if that happened, we'll detect this fact and do nothing with c // but ONLY if the lwarx doesn't trap // (due to the memory tmp points to // getting unmapped when the other thread frees the object)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With