I understand a good compiler can perform optimization so that accessing, say an int*
via a std::shared_ptr
is done using the same assembly as if a raw int*
pointer was used.
My question is: would the cache line containing the optimized smart pointer be polluted with other data members from the smart pointer, like the reference counter? So that although the assembly generated would be identical with a real pointer, the cache performance could be worse because not as much of the cache line is being used efficiently?
EDIT: This performance effect could be more-noticeable if we were iterating through a structure like std::vector<std::shared_ptr<int>>
and using the ints.
Use unique_ptr when you want a single pointer to an object that will be reclaimed when that single pointer is destroyed. Use shared_ptr when you want multiple pointers to the same resource.
An object referenced by the contained raw pointer will not be destroyed until reference count is greater than zero i.e. until all copies of shared_ptr have been deleted. So, we should use shared_ptr when we want to assign one raw pointer to multiple owners. // referring to the same managed object.
By moving the shared_ptr instead of copying it, we "steal" the atomic reference count and we nullify the other shared_ptr . "stealing" the reference count is not atomic, and it is hundred times faster than copying the shared_ptr (and causing atomic reference increment or decrement).
Smart pointers are class objects that behave like raw pointers but manage objects that are new and when or whether to delete them— smart pointers automatically delete the managed object at the appropriate time.
There are several things to consider, but in summary one can say: It doesn't really matter.
First, there is no guarantee (or rather, no requirement) that there is a reference counter at all. It is merely required which way std::shared_ptr
has to behave. Using a reference counter is one way of achieving this, circular-linked-list being another.
Practically, all implementations (all that I know of, at least) do use a reference counter.
Second, the reference counter can be allocated separately via operator new
or in the same location and using the same allocation as the managed object (which is created via placement new
) if you use make_shared
. The latter is, again, not strictly guaranteed. The standard states "Implementations are encouraged, but not required, to perform no more than one memory allocation", explicitly allowing something different
If the reference counter is separately allocated, then it will most likely live in a different cache line, and so the system will consume two cache lines rather than one when accessing the object. That, however, is in some situations an advantage, not a disadvantage (that is, whenever you copy the smart pointer, see below).
In addition to the data cache, there exists also the TLB, which is much smaller (usually less than 64 entries). Given that there is a certain likelihood that two separately allocated objects will not only be in different cache lines but in different memory pages, this may be another thing that will cost a few extra cycles.
If the reference counter is allocated in the same location, it will very likely be in the same cache line as the beginning of the object (but it might be in the previous cache line as well). This looks like an advantage, but it is not necessarily one.
Whenever the shared_ptr
is copied or one smart pointer referencing the same object goes out of scope, the reference counter must be modified. This is a write to the cache line the counter lives in (or rather, since it's an atomic operation, not a write to the cache line, but the net effect to the outside world is the same). The cache line is invalidated and must be fetched again by everybody else wanting to access it.
There exists a common problem known as "false sharing" where naive parallel processing runs much slower than everybody would expect, which occurs for that exact same reason.
Now if the reference counter is allocated together with the object, this means that whenever the shared pointer is copied (or goes out of scope) the next access to the object is a guaranteed cache miss (since the cache line containing the start of the object will be purged from cache).
Yes, there are cache effects, however, you should not worry too much. Cache misses happen all the time, and regularly (the same goes for the TLB). As long as you have at least a somewhat coherent access pattern, this will not matter at all.
After the next context switch (that is, every few milliseconds, or after the next interrupt or syscall) your cache and TLB will likely be gone anyway. That's something everyone has to live with, and it's not a problem at all.
You do not use a shared_ptr
for fun, but because it provides a valuable functionality that you need. Copying the pointer might be 3-4 cycles slower than copying a raw pointer, and it might cause an occasional extra cache miss when you copy it, but you don't make a hundred thousand copies per second.
The safety and overall utility benefits by far outweight the disadvantages.
I have good reasons to believe that you are worrying about the wrong thing.
Consider the following two snippets:
#include <cstdio>
#include <memory>
#include <vector>
int main() {
std::vector<std::shared_ptr<int>> v = { std::make_shared<int>(1),
std::make_shared<int>(2),
std::make_shared<int>(3) };
for (auto& e : v)
std::printf("%d\n", *e);
}
and the same functionality but with unique_ptr
:
#include <cstdio>
#include <memory>
#include <vector>
template<typename T, typename... Args>
std::unique_ptr<T> make_unique(Args&&... args) {
return std::unique_ptr<T>(new T(std::forward<Args>(args)...));
}
int main() {
std::vector<std::unique_ptr<int>> v;
v.reserve(3);
v.emplace_back(make_unique<int>(1));
v.emplace_back(make_unique<int>(2));
v.emplace_back(make_unique<int>(3));
for (auto& e : v)
std::printf("%d\n", *e);
}
Just to answer your question:
This performance effect could be more-noticeable if we were iterating through a structure like
std::vector<std::shared_ptr<int>>
and using theint
s.
Here are the corresponding loops:
shared_ptr version | unique_ptr version
.L66: | .L70:
movq (%rbx), %rax | movq (%rbx), %rax
movl $.LC0, %esi | movl $.LC0, %esi
movl $1, %edi | movl $1, %edi
movl (%rax), %edx | movl (%rax), %edx
xorl %eax, %eax | xorl %eax, %eax
.LEHB2: |.LEHB6:
call __printf_chk | call __printf_chk
.LEHE2: |.LEHE6:
addq $16, %rbx | addq $8, %rbx
cmpq %rbx, %rbp | cmpq %rbx, %rbp
jne .L66 | jne .L70
The only relevant difference is that you are doing larger steps with shared_ptr
when reading the memory. In both cases the memory access pattern if linear with fixed stride length. I failed to come up with a scenario where there is any measurable difference between the two.
In my opinion, that answers your question.
Some unsolicited advice, just to show you that you are worrying about the wrong thing. Analyze the generated assembly code for the two code snippets above (g++ -std=c++11 -Wall -Wextra -pedantic -fwhole-program -O3 -S
and run the assembly through c++filt
). You will find out that shared_ptr
is a heavy weight object compared to unique_ptr
. You will see that it is very expensive to construct and destroy shared_ptr
s, and to maintain the atomic reference counts; you will pay for the heavy weight multi-threading machinery as well.
I know it is a questionable metric but the assembly code for shared_ptr
is 740 lines, and the code for the unique_ptr
is 402 lines. If you analyze the construction and destruction of the objects, you will notice that significantly more code is executed for shared_ptr
and many of those instructions are more expensive (expensive multi-threading stuff). shared_ptr
also consumes more memory for the extra bookkeeping (on these silly code snippets: 144 vs. 36 bytes, that is, 4 times more with shared_ptr
).
Potential cache misses would be the last thing that I would worry about at this point.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With