A few days ago I happened to watch this very interesting presentation by Stephan T. Lavavej, which mentions the "We Know Where You Live" optimization (sorry for using the acronym in the question title, SO warned me the question might have been closed otherwise), and this beautiful one by Herb Sutter on machine architecture.
Briefly, the "We Know Where You Live" optimization consists in placing the reference counters on the same memory block as the object which make_shared
is creating, thus resulting in one single memory allocation rather than two and making shared_ptr
more compact.
After summing up what I learnt from the two presentations above, however, I started to wonder whether the WKWYL optimization could not degrade performance in case shared_ptr
is accessed by multiple threads running on different cores.
If the reference counters are close to the actual object in memory, in fact, they should be more likely to be fetched into the same cache line as the object itself. This in turn, if I got the lesson correctly, would make it more likely that threads will slow down while competing for the same cache line even when they do not need to.
Suppose one of the threads needs to update the reference counter several times (e.g. when copying the shared_ptr
around), while the other ones just need to access the pointed object: isn't this going to slow down the execution of all threads by making them compete for the same cache line?
If the refcount lived somewhere else in memory, I would say contention would be less likely to arise.
Does this make a good argument against using make_shared()
in similar cases (as long as it implements the WKWYL optimization, of course)? Or is there a fallacy in my reasoning?
If that's your usage pattern then sure, make_shared
will result in "false sharing", which is the name I know for different threads using the same cache line even though they aren't accessing the same bytes.
The same is true for any object of which nearby parts are used by different threads (one of which is writing). In this case the "object" is the combined block created by make_shared
. You could as well ask whether any attempt to benefit from data locality can backfire in cases where proximal data is used in different threads more-or-less simultaneously. Yes, it can.
One can conclude that contention is less likely to arise if every writable part of every object is allocated in distant locations. So, usually the fix for false sharing is to spread things out (in this case, you could stop using make_shared
or you could put padding into the object to separate its parts into different cache lines).
As against that, when the different parts are used in the same thread, if you've spread them through memory then there's a cost to that, because there's more to fetch into cache. Since spreading things out has its own costs, that might not actually help for quite so many apps as you'd first think. But no doubt it's possible to write code for which it does help.
Sometimes the benefit of make_shared
is nothing to do with cache lines and locality, it's simply that it makes one dynamic allocation instead of two. The value of that depends on how many objects you allocate and free: it might be negligible; it might be the difference between your app fitting in RAM vs. swapping like crazy; in certain cases it could be necessary for your app to make all the allocations it needs to.
FYI, there's another situation to maybe not to use make_shared
, and that's when the object isn't small and you have weak pointers that significantly outlive the shared_ptr
. The reason is that the control block isn't freed until the weak pointers are gone, and hence if you used make_shared
then the whole memory occupied by the object isn't freed until the weak pointers are gone. The object will be destructed as soon as the shared pointers are, of course, so it's just the size of the class that matters, not associated resources.
Note that allocating the ref count isn't directly about the WKWYL optimization -- that's the primary intended effect of std::make_shared
itself. You have full control: use make_shared<T>()
to save an allocation and put the reference count with the object, or use shared_ptr<T>( new T() )
to keep it separate.
Yes, if you place the object and the reference count in the same cacheline, it might lead to performance degradations due to false sharing, if the reference count is updated frequently while the object is accessed only reading.
However the way I see it there are two factors why this isn't factored into the decision for doing this optimization:
shared_ptr
is implemented on your platform you could emulate the padding, either by inserting padding into the object, or (possibly, depending on the order in memory), by giving it a deleter, which includes enough padding.Suppose one of the threads needs to update the reference counter several times (e.g. when copying the
shared_ptr
around), while the other ones just need to access the pointed object: isn't this going to slow down the execution of all threads by making them compete for the same cache line?
Yes, but is that a realistic scenario?
In my code the threads that copy the shared_ptr
do so because they want to share ownership of the object so they can use it. If the threads making all those reference-count updates don't care about the object, why are they bothering to share in ownership of it?
You can mitigate the problem by passing around const shared_ptr&
references and only making (or destroying) a copy when you actually want to own and access the object, e.g. when transferring it across thread or module boundaries or when taking ownership of the object to use it.
In general, intrusive reference counts outperform external reference counts (see Smart Pointer Timings) precisely because they're on a single cache line and so you don't need to use up two precious cache lines for the object and its refcount. Remember that if you've used up an extra cache line that's one less cache line for everything else, and something will get evicted and you'll get a cache miss when that is next needed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With