Does accessing via a shared_ptr pollute the cache line more than raw pointer?

Tags:

I understand a good compiler can perform optimization so that accessing, say an int* via a std::shared_ptr is done using the same assembly as if a raw int* pointer was used.

My question is: would the cache line containing the optimized smart pointer be polluted with other data members from the smart pointer, like the reference counter? So that although the assembly generated would be identical with a real pointer, the cache performance could be worse because not as much of the cache line is being used efficiently?

EDIT: This performance effect could be more-noticeable if we were iterating through a structure like std::vector<std::shared_ptr<int>> and using the ints.

853

asked Apr 16 '14 21:04

user997112

2 Answers

There are several things to consider, but in summary one can say: It doesn't really matter.

First, there is no guarantee (or rather, no requirement) that there is a reference counter at all. It is merely required which way std::shared_ptr has to behave. Using a reference counter is one way of achieving this, circular-linked-list being another.
Practically, all implementations (all that I know of, at least) do use a reference counter.

Second, the reference counter can be allocated separately via operator new or in the same location and using the same allocation as the managed object (which is created via placement new) if you use make_shared. The latter is, again, not strictly guaranteed. The standard states "Implementations are encouraged, but not required, to perform no more than one memory allocation", explicitly allowing something different

If the reference counter is separately allocated, then it will most likely live in a different cache line, and so the system will consume two cache lines rather than one when accessing the object. That, however, is in some situations an advantage, not a disadvantage (that is, whenever you copy the smart pointer, see below).
In addition to the data cache, there exists also the TLB, which is much smaller (usually less than 64 entries). Given that there is a certain likelihood that two separately allocated objects will not only be in different cache lines but in different memory pages, this may be another thing that will cost a few extra cycles.

If the reference counter is allocated in the same location, it will very likely be in the same cache line as the beginning of the object (but it might be in the previous cache line as well). This looks like an advantage, but it is not necessarily one.

Whenever the shared_ptr is copied or one smart pointer referencing the same object goes out of scope, the reference counter must be modified. This is a write to the cache line the counter lives in (or rather, since it's an atomic operation, not a write to the cache line, but the net effect to the outside world is the same). The cache line is invalidated and must be fetched again by everybody else wanting to access it.
There exists a common problem known as "false sharing" where naive parallel processing runs much slower than everybody would expect, which occurs for that exact same reason.

Now if the reference counter is allocated together with the object, this means that whenever the shared pointer is copied (or goes out of scope) the next access to the object is a guaranteed cache miss (since the cache line containing the start of the object will be purged from cache).

tl;dr

Yes, there are cache effects, however, you should not worry too much. Cache misses happen all the time, and regularly (the same goes for the TLB). As long as you have at least a somewhat coherent access pattern, this will not matter at all.
After the next context switch (that is, every few milliseconds, or after the next interrupt or syscall) your cache and TLB will likely be gone anyway. That's something everyone has to live with, and it's not a problem at all.

You do not use a shared_ptr for fun, but because it provides a valuable functionality that you need. Copying the pointer might be 3-4 cycles slower than copying a raw pointer, and it might cause an occasional extra cache miss when you copy it, but you don't make a hundred thousand copies per second.

The safety and overall utility benefits by far outweight the disadvantages.

165

answered Oct 11 '22 06:10

Damon

I have good reasons to believe that you are worrying about the wrong thing.

Consider the following two snippets:

#include <cstdio>
#include <memory>
#include <vector>

int main() {
  std::vector<std::shared_ptr<int>> v = { std::make_shared<int>(1), 
                                          std::make_shared<int>(2),
                                          std::make_shared<int>(3) };
  for (auto& e : v)
    std::printf("%d\n", *e);
}

and the same functionality but with unique_ptr:

#include <cstdio>
#include <memory>
#include <vector>

template<typename T, typename... Args>
std::unique_ptr<T> make_unique(Args&&... args) {
    return std::unique_ptr<T>(new T(std::forward<Args>(args)...));
}

int main() {
  std::vector<std::unique_ptr<int>> v;
  v.reserve(3);
  v.emplace_back(make_unique<int>(1));
  v.emplace_back(make_unique<int>(2));
  v.emplace_back(make_unique<int>(3));
  for (auto& e : v)
    std::printf("%d\n", *e);
}

Just to answer your question:

This performance effect could be more-noticeable if we were iterating through a structure like std::vector<std::shared_ptr<int>> and using the ints.

Here are the corresponding loops:

    shared_ptr version   |    unique_ptr version
.L66:                    | .L70:
    movq    (%rbx), %rax |    movq    (%rbx), %rax
    movl    $.LC0, %esi  |    movl    $.LC0, %esi
    movl    $1, %edi     |    movl    $1, %edi
    movl    (%rax), %edx |    movl    (%rax), %edx
    xorl    %eax, %eax   |    xorl    %eax, %eax
.LEHB2:                  |.LEHB6:
    call    __printf_chk |    call    __printf_chk
.LEHE2:                  |.LEHE6:
    addq    $16, %rbx    |    addq    $8, %rbx
    cmpq    %rbx, %rbp   |    cmpq    %rbx, %rbp
    jne    .L66          |    jne    .L70

The only relevant difference is that you are doing larger steps with shared_ptr when reading the memory. In both cases the memory access pattern if linear with fixed stride length. I failed to come up with a scenario where there is any measurable difference between the two.

In my opinion, that answers your question.

Some unsolicited advice, just to show you that you are worrying about the wrong thing. Analyze the generated assembly code for the two code snippets above (g++ -std=c++11 -Wall -Wextra -pedantic -fwhole-program -O3 -S and run the assembly through c++filt). You will find out that shared_ptr is a heavy weight object compared to unique_ptr. You will see that it is very expensive to construct and destroy shared_ptrs, and to maintain the atomic reference counts; you will pay for the heavy weight multi-threading machinery as well.

I know it is a questionable metric but the assembly code for shared_ptr is 740 lines, and the code for the unique_ptr is 402 lines. If you analyze the construction and destruction of the objects, you will notice that significantly more code is executed for shared_ptr and many of those instructions are more expensive (expensive multi-threading stuff). shared_ptr also consumes more memory for the extra bookkeeping (on these silly code snippets: 144 vs. 36 bytes, that is, 4 times more with shared_ptr).

Potential cache misses would be the last thing that I would worry about at this point.

answered Oct 11 '22 05:10

Ali

Related questions
                            
                                I can't compile a solution due to "Error LNK2028: unresolved token..."
                            
                                Why does my C++ program crash on one machine and not on another?
                            
                                c++ unordered_map is there a way to pre-allocate memory for elements if max size known in advance
                            
                                Visual Studio 2012 Performance Analysis failing to load symbol file
                            
                                How can I make llvm optimize the following floating point equation
                            
                                cv::imshow sometimes is very slow
                            
                                DoubleClick events on Qtreeview items
                            
                                How to determine if a function returns a reference in C++03?
                            
                                GCC: cutting down template names in output
                            
                                regex_token_iterator *it++ bug?
                            
                                Commit C++ Intellisense suggestion with space bar in VS 2013
                            
                                MKL or BLAS routine to multiply vector by a scalar out-of-place
                            
                                Why is std::chrono::system_clock::to_time_t() not constexpr?
                            
                                Ceres Solver: unable to disable logging (google glog)
                            
                                C++/QML project compatible with both Qt 4 (QtQuick 1.x) and Qt 5 (QtQuick 2.x)
                            
                                C++ range for loops with custom step
                            
                                What to use as replacement for concepts (upcoming feature) in C++?
                            
                                Object Orientation: How to Choose from a Number of Implementations
                            
                                C++ const conversion in the case of operators
                            
                                Profinet / Profibus linux

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Does accessing via a shared_ptr pollute the cache line more than raw pointer?

Tags:

c++

performance

optimization

shared-ptr

user997112

People also ask

2 Answers

tl;dr

Damon

Ali

Recent Activity

Donate For Us