Does the WKWYL optimization in make_shared() introduce a penalty for some multithreading applications?

Question

A few days ago I happened to watch this very interesting presentation by Stephan T. Lavavej, which mentions the "We Know Where You Live" optimization (sorry for using the acronym in the question title, SO warned me the question might have been closed otherwise), and this beautiful one by Herb Sutter on machine architecture.

Briefly, the "We Know Where You Live" optimization consists in placing the reference counters on the same memory block as the object which make_shared is creating, thus resulting in one single memory allocation rather than two and making shared_ptr more compact.

After summing up what I learnt from the two presentations above, however, I started to wonder whether the WKWYL optimization could not degrade performance in case shared_ptr is accessed by multiple threads running on different cores.

If the reference counters are close to the actual object in memory, in fact, they should be more likely to be fetched into the same cache line as the object itself. This in turn, if I got the lesson correctly, would make it more likely that threads will slow down while competing for the same cache line even when they do not need to.

Suppose one of the threads needs to update the reference counter several times (e.g. when copying the shared_ptr around), while the other ones just need to access the pointed object: isn't this going to slow down the execution of all threads by making them compete for the same cache line?

If the refcount lived somewhere else in memory, I would say contention would be less likely to arise.

Does this make a good argument against using make_shared() in similar cases (as long as it implements the WKWYL optimization, of course)? Or is there a fallacy in my reasoning?

Steve Jessop · Accepted Answer

If that's your usage pattern then sure, make_shared will result in "false sharing", which is the name I know for different threads using the same cache line even though they aren't accessing the same bytes.

The same is true for any object of which nearby parts are used by different threads (one of which is writing). In this case the "object" is the combined block created by make_shared. You could as well ask whether any attempt to benefit from data locality can backfire in cases where proximal data is used in different threads more-or-less simultaneously. Yes, it can.

One can conclude that contention is less likely to arise if every writable part of every object is allocated in distant locations. So, usually the fix for false sharing is to spread things out (in this case, you could stop using make_shared or you could put padding into the object to separate its parts into different cache lines).

As against that, when the different parts are used in the same thread, if you've spread them through memory then there's a cost to that, because there's more to fetch into cache. Since spreading things out has its own costs, that might not actually help for quite so many apps as you'd first think. But no doubt it's possible to write code for which it does help.

Sometimes the benefit of make_shared is nothing to do with cache lines and locality, it's simply that it makes one dynamic allocation instead of two. The value of that depends on how many objects you allocate and free: it might be negligible; it might be the difference between your app fitting in RAM vs. swapping like crazy; in certain cases it could be necessary for your app to make all the allocations it needs to.

FYI, there's another situation to maybe not to use make_shared, and that's when the object isn't small and you have weak pointers that significantly outlive the shared_ptr. The reason is that the control block isn't freed until the weak pointers are gone, and hence if you used make_shared then the whole memory occupied by the object isn't freed until the weak pointers are gone. The object will be destructed as soon as the shared pointers are, of course, so it's just the size of the class that matters, not associated resources.

Grizzly · Answer

Note that allocating the ref count isn't directly about the WKWYL optimization -- that's the primary intended effect of std::make_shared itself. You have full control: use make_shared<T>() to save an allocation and put the reference count with the object, or use shared_ptr<T>( new T() ) to keep it separate.

Yes, if you place the object and the reference count in the same cacheline, it might lead to performance degradations due to false sharing, if the reference count is updated frequently while the object is accessed only reading.

However the way I see it there are two factors why this isn't factored into the decision for doing this optimization:

In general you don't want the reference count to change frequently, since that by itself is a performance problem (atomic operations, several threads accessing it, ...) which you want to avoid (and probably can for most cases)
Doing this optimization doesn't necessarily incur the potential extra performance problems you described. For that to happen the reference count and (parts of) the object need to be in the same cacheline. It could therefore easily be avoided by adding appropriate padding between the reference count (+other data) and the object. In that case the optimization would still only do one allocation instead of two and therefore still be beneficial. However for the more likely case which doesn't trigger this behaviour, it would be slower then the non padded version, since in the non padded version you benefit from the better locality (the object and reference count being in the same cacheline). For this reason I think that this variant is a possible optimization for highly threaded code, but not necessarily one to be made in the standard version.
If you know how shared_ptr is implemented on your platform you could emulate the padding, either by inserting padding into the object, or (possibly, depending on the order in memory), by giving it a deleter, which includes enough padding.

Jonathan Wakely · Answer

Suppose one of the threads needs to update the reference counter several times (e.g. when copying the shared_ptr around), while the other ones just need to access the pointed object: isn't this going to slow down the execution of all threads by making them compete for the same cache line?

Yes, but is that a realistic scenario?

In my code the threads that copy the shared_ptr do so because they want to share ownership of the object so they can use it. If the threads making all those reference-count updates don't care about the object, why are they bothering to share in ownership of it?

You can mitigate the problem by passing around const shared_ptr& references and only making (or destroying) a copy when you actually want to own and access the object, e.g. when transferring it across thread or module boundaries or when taking ownership of the object to use it.

In general, intrusive reference counts outperform external reference counts (see Smart Pointer Timings) precisely because they're on a single cache line and so you don't need to use up two precious cache lines for the object and its refcount. Remember that if you've used up an extra cache line that's one less cache line for everything else, and something will get evicted and you'll get a cache miss when that is next needed.

Does the WKWYL optimization in make_shared<>() introduce a penalty for some multithreading applications?

Tags:

c++

optimization

caching

c++11

shared-ptr

Andy Prowl

3 Answers

Steve Jessop

Grizzly

Jonathan Wakely

Recent Activity

Donate For Us