Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does accessing via a shared_ptr pollute the cache line more than raw pointer?

I understand a good compiler can perform optimization so that accessing, say an int* via a std::shared_ptr is done using the same assembly as if a raw int* pointer was used.

My question is: would the cache line containing the optimized smart pointer be polluted with other data members from the smart pointer, like the reference counter? So that although the assembly generated would be identical with a real pointer, the cache performance could be worse because not as much of the cache line is being used efficiently?

EDIT: This performance effect could be more-noticeable if we were iterating through a structure like std::vector<std::shared_ptr<int>> and using the ints.

like image 853
user997112 Avatar asked Apr 16 '14 21:04

user997112


People also ask

In what situation is a shared_ptr more appropriate than a Unique_ptr?

Use unique_ptr when you want a single pointer to an object that will be reclaimed when that single pointer is destroyed. Use shared_ptr when you want multiple pointers to the same resource.

Should you use shared_ptr?

An object referenced by the contained raw pointer will not be destroyed until reference count is greater than zero i.e. until all copies of shared_ptr have been deleted. So, we should use shared_ptr when we want to assign one raw pointer to multiple owners. // referring to the same managed object.

What happens when you move a shared_ptr?

By moving the shared_ptr instead of copying it, we "steal" the atomic reference count and we nullify the other shared_ptr . "stealing" the reference count is not atomic, and it is hundred times faster than copying the shared_ptr (and causing atomic reference increment or decrement).

Why are smart pointers better than raw pointers?

Smart pointers are class objects that behave like raw pointers but manage objects that are new and when or whether to delete them— smart pointers automatically delete the managed object at the appropriate time.


2 Answers

There are several things to consider, but in summary one can say: It doesn't really matter.

First, there is no guarantee (or rather, no requirement) that there is a reference counter at all. It is merely required which way std::shared_ptr has to behave. Using a reference counter is one way of achieving this, circular-linked-list being another.
Practically, all implementations (all that I know of, at least) do use a reference counter.

Second, the reference counter can be allocated separately via operator new or in the same location and using the same allocation as the managed object (which is created via placement new) if you use make_shared. The latter is, again, not strictly guaranteed. The standard states "Implementations are encouraged, but not required, to perform no more than one memory allocation", explicitly allowing something different

If the reference counter is separately allocated, then it will most likely live in a different cache line, and so the system will consume two cache lines rather than one when accessing the object. That, however, is in some situations an advantage, not a disadvantage (that is, whenever you copy the smart pointer, see below).
In addition to the data cache, there exists also the TLB, which is much smaller (usually less than 64 entries). Given that there is a certain likelihood that two separately allocated objects will not only be in different cache lines but in different memory pages, this may be another thing that will cost a few extra cycles.

If the reference counter is allocated in the same location, it will very likely be in the same cache line as the beginning of the object (but it might be in the previous cache line as well). This looks like an advantage, but it is not necessarily one.

Whenever the shared_ptr is copied or one smart pointer referencing the same object goes out of scope, the reference counter must be modified. This is a write to the cache line the counter lives in (or rather, since it's an atomic operation, not a write to the cache line, but the net effect to the outside world is the same). The cache line is invalidated and must be fetched again by everybody else wanting to access it.
There exists a common problem known as "false sharing" where naive parallel processing runs much slower than everybody would expect, which occurs for that exact same reason.

Now if the reference counter is allocated together with the object, this means that whenever the shared pointer is copied (or goes out of scope) the next access to the object is a guaranteed cache miss (since the cache line containing the start of the object will be purged from cache).

tl;dr

Yes, there are cache effects, however, you should not worry too much. Cache misses happen all the time, and regularly (the same goes for the TLB). As long as you have at least a somewhat coherent access pattern, this will not matter at all.
After the next context switch (that is, every few milliseconds, or after the next interrupt or syscall) your cache and TLB will likely be gone anyway. That's something everyone has to live with, and it's not a problem at all.

You do not use a shared_ptr for fun, but because it provides a valuable functionality that you need. Copying the pointer might be 3-4 cycles slower than copying a raw pointer, and it might cause an occasional extra cache miss when you copy it, but you don't make a hundred thousand copies per second.

The safety and overall utility benefits by far outweight the disadvantages.

like image 165
Damon Avatar answered Oct 11 '22 06:10

Damon


I have good reasons to believe that you are worrying about the wrong thing.

Consider the following two snippets:

#include <cstdio>
#include <memory>
#include <vector>

int main() {
  std::vector<std::shared_ptr<int>> v = { std::make_shared<int>(1), 
                                          std::make_shared<int>(2),
                                          std::make_shared<int>(3) };
  for (auto& e : v)
    std::printf("%d\n", *e);
}

and the same functionality but with unique_ptr:

#include <cstdio>
#include <memory>
#include <vector>

template<typename T, typename... Args>
std::unique_ptr<T> make_unique(Args&&... args) {
    return std::unique_ptr<T>(new T(std::forward<Args>(args)...));
}

int main() {
  std::vector<std::unique_ptr<int>> v;
  v.reserve(3);
  v.emplace_back(make_unique<int>(1));
  v.emplace_back(make_unique<int>(2));
  v.emplace_back(make_unique<int>(3));
  for (auto& e : v)
    std::printf("%d\n", *e);
}

Just to answer your question:

This performance effect could be more-noticeable if we were iterating through a structure like std::vector<std::shared_ptr<int>> and using the ints.

Here are the corresponding loops:

    shared_ptr version   |    unique_ptr version
.L66:                    | .L70:
    movq    (%rbx), %rax |    movq    (%rbx), %rax
    movl    $.LC0, %esi  |    movl    $.LC0, %esi
    movl    $1, %edi     |    movl    $1, %edi
    movl    (%rax), %edx |    movl    (%rax), %edx
    xorl    %eax, %eax   |    xorl    %eax, %eax
.LEHB2:                  |.LEHB6:
    call    __printf_chk |    call    __printf_chk
.LEHE2:                  |.LEHE6:
    addq    $16, %rbx    |    addq    $8, %rbx
    cmpq    %rbx, %rbp   |    cmpq    %rbx, %rbp
    jne    .L66          |    jne    .L70

The only relevant difference is that you are doing larger steps with shared_ptr when reading the memory. In both cases the memory access pattern if linear with fixed stride length. I failed to come up with a scenario where there is any measurable difference between the two.

In my opinion, that answers your question.


Some unsolicited advice, just to show you that you are worrying about the wrong thing. Analyze the generated assembly code for the two code snippets above (g++ -std=c++11 -Wall -Wextra -pedantic -fwhole-program -O3 -S and run the assembly through c++filt). You will find out that shared_ptr is a heavy weight object compared to unique_ptr. You will see that it is very expensive to construct and destroy shared_ptrs, and to maintain the atomic reference counts; you will pay for the heavy weight multi-threading machinery as well.

I know it is a questionable metric but the assembly code for shared_ptr is 740 lines, and the code for the unique_ptr is 402 lines. If you analyze the construction and destruction of the objects, you will notice that significantly more code is executed for shared_ptr and many of those instructions are more expensive (expensive multi-threading stuff). shared_ptr also consumes more memory for the extra bookkeeping (on these silly code snippets: 144 vs. 36 bytes, that is, 4 times more with shared_ptr).

Potential cache misses would be the last thing that I would worry about at this point.

like image 25
Ali Avatar answered Oct 11 '22 05:10

Ali