Performance of resizing std::vector<std::unique_ptr<T>>

Tags:

The general conception seems to be that std::unique_ptr has no time overhead compared to properly used owning raw pointers, given sufficient optimization.

But what about using std::unique_ptr in compound data structures, in particular std::vector<std::unique_ptr<T>>? For instance, resizing the underlying data of a vector, which can happen during push_back. To isolate the performance, I loop around pop_back, shrink_to_fit, emplace_back:

#include <chrono> #include <vector> #include <memory> #include <iostream>  constexpr size_t size = 1000000; constexpr size_t repeat = 1000; using my_clock = std::chrono::high_resolution_clock;  template<class T> auto test(std::vector<T>& v) {     v.reserve(size);     for (size_t i = 0; i < size; i++) {         v.emplace_back(new int());     }     auto t0 = my_clock::now();     for (int i = 0; i < repeat; i++) {         auto back = std::move(v.back());         v.pop_back();         v.shrink_to_fit();         if (back == nullptr) throw "don't optimize me away";         v.emplace_back(std::move(back));     }     return my_clock::now() - t0; }  int main() {     std::vector<std::unique_ptr<int>> v_u;     std::vector<int*> v_p;      auto millis_p = std::chrono::duration_cast<std::chrono::milliseconds>(test(v_p));     auto millis_u = std::chrono::duration_cast<std::chrono::milliseconds>(test(v_u));     std::cout << "raw pointer: " << millis_p.count() << " ms, unique_ptr: " << millis_u.count() << " ms\n";     for (auto p : v_p) delete p; // I don't like memory leaks ;-) }

Compiling the code with -O3 -o -march=native -std=c++14 -g with gcc 7.1.0, clang 3.8.0, and 17.0.4 on Linux on a Intel Xeon E5-2690 v3 @ 2.6 GHz (no turbo):

raw pointer: 2746 ms, unique_ptr: 5140 ms  (gcc) raw pointer: 2667 ms, unique_ptr: 5529 ms  (clang) raw pointer: 1448 ms, unique_ptr: 5374 ms  (intel)

The raw pointer version spends all it's time in an optimized memmove (intel seems to have a much better one than clang and gcc). The unique_ptr code seems to first copy over the vector data from one memory block to the other and assign the original one with zero - all in a horribly un-optimized loop. And then it loops over the original block of data again to see if any of those that were just zero'd are nonzero and need to be deleted. The full gory detail can be seen on godbolt. The question is not how the compiled code differs, that is pretty clear. The question is why the compiler fails to optimize what is generally regarded as a no-extra-overhead abstraction.

Trying to understand how the compilers reason about handling std::unique_ptr, I was looking a bit more at isolated code. For instance:

void foo(std::unique_ptr<int>& a, std::unique_ptr<int>& b) {   a.release();   a = std::move(b); }

or the similar

a.release(); a.reset(b.release());

none of the x86 compilers seem to be able to optimize away the senseless if (ptr) delete ptr;. The Intel compiler even gives the delete a 28 % chance. Surprisingly, the delete check is consistently omitted for:

auto tmp = b.release(); a.release(); a.reset(tmp);

These bits are not the main aspect of this question, but all of this makes me feel that I am missing something.

Why do various compilers fail to optimize reallocation within std::vector<std::unique_ptr<int>>? Is there anything in the standard that prevents generating code as efficient as with raw pointers? Is this an issue with the standard library implementation? Or are the compilers just not sufficiently clever enough (yet)?

What can one do to avoid performance impact compared to using raw pointers?

Note: Assume that T is polymorphic and expensive to move, so std::vector<T> is not an option.

803

asked Jul 13 '17 19:07

Zulan

1 Answers

The claim that unique_ptr performs as well as a raw pointer after optimization mostly applies only to the basic operations on a single pointer, such as creation, dereferencing, assignment of a single pointer and deletion. Those operations are defined simply enough that an optimizing compiler can usually make the required transformations such that the resulting code is equivalent (or nearly so) in performance to the raw version⁰.

One place this falls apart is especially higher level language-based optimizations on array-based containers such as std::vector, as you have noted with your test. These containers usually use source level optimizations which depend on type traits to determine at compile time if a type can safely be copied using a byte-wise copy such as memcpy, and delegate to such a method if so, or otherwise fall back to an element-wise copy loop.

To be safely copyable with memcpy an object must be trivially copyable. Now std::unique_ptr is not trivially copyable since indeed it fails several of the requirements such as having only trivial or deleted copy and move constructors. The exact mechanism depends on the standard library involved, but in general a quality std::vector implementation will end up calling a specialized form of something like std::uninitialized_copy for trivially-copyable types that just delegates to memmove.

The typical implementation details are quite tortured, but for libstc++ (used by gcc) you can see the high-level divergence in std::uninitialized_copy:

 template<typename _InputIterator, typename _ForwardIterator>  inline _ForwardIterator  uninitialized_copy(_InputIterator __first, _InputIterator __last,                     _ForwardIterator __result)  {         ...    return std::__uninitialized_copy<__is_trivial(_ValueType1)                                     && __is_trivial(_ValueType2)                                     && __assignable>::      __uninit_copy(__first, __last, __result);  }

From there you can take my word that many of the std::vector "movement" methods end up here, and that __uninitialized_copy<true>::__uinit_copy(...) ultimately calls memmove while the <false> version doesn't - or you can trace through the code yourself (but you already saw the result in your benchmark).

Ultimately then, you end up with a several loops that perform the required copy steps for non-trivial objects, such as calling the move constructor of the destination object, and subsequently calling the destructor of all the source objects. These are separate loops and even modern compilers will pretty much not be able to reason about something like "OK, in the first loop I moved all the destination objects so their ptr member will be null, so the second loop is a no-op". Finally, to equal the speed of raw pointers, not only would compilers need to optimize across these two loops, they would need to have a transformation which recognizes that the whole thing can be replaced by memcpy or memmove².

So one answer to your question is that compilers just aren't smart enough to do this optimization, but it's largely because the "raw" version has a lot of compile-time help to skip the need for this optimization entirely.

Loop Fusion

As mentioned the existing vector implementations implement a resize-type operation in two separate loops (in addition to non-loop work such as allocating the new storage and freeing the old storage):

Copying the source objects into the newly allocated destination array (conceptually using something like placement new calling the move constructor).
Destroying the source objects in the old region.

Conceptually you could imagine an alternative way: doing this all in one loop, copying each element and them immediately destroying it. It possible that a compiler could even notice that the two loops iterate over the same set of values and fuse the two loops into one. [Apparently], howevever, (https://gcc.gnu.org/ml/gcc/2015-04/msg00291.html) gcc doesn't do any loop fusion today, and nor do clang or icc if you believe this test.

So then we are left trying to put the loops together explicitly at the source level.

Now the two-loop implementation helps preserve the exception safety contract of the operation by not destroying any source objects until we know the construction part of the copy has completed, but it also helps to optimize the copy and destruction when we have trivially-copyable and trivially-destructible objects, respectively. In particular, with simple-traits based selection we can replace the copy with a memmove and the destruction loop can be elided entirely³.

So the two-loop approach helps when those optimizations apply, but it actually hurts in the general case of objects which are neither trivially copyable or destructible. It means you need two passes over the objects and you lose the opportunity to optimize and eliminate code between the copy of the object and it's subsequent destruction. In the unique_ptr case you lose the ability for the compiler to propagate the knowledge that the source unique_ptr will have a NULL internal ptr member and hence skip the if (ptr) delete ptr check entirely⁴.

Trivially Movable

Now one might ask whether we could apply the same type-traits compile-time optimization to the unique_ptr case. For example, one might look at the trivially copyable requirements and see that they are perhaps too strict for the common move operations in std::vector. Sure, a unique_ptr is evidently not trivially copyable since a bit-wise copy would leave both the source and destination object owing the same pointer (and result in double-deletion), but it seems that it should be bit-wise movable: if you move a unique_ptr from one area of memory to another, such that you no longer consider the source as a live object (and hence won't call its destructor) it should "just work", for the typical unique_ptr implementation.

Unfortunately, no such "trivial move" concept exists, although you could try to roll your own. There seems to be an open debate about whether this is UB or not for objects that can be byte-wise copied and do not depend on their constructor or destructor behavior in the move scenario.

You could always implement your own trivially movable concept, which would be something like (a) the object has a trivial move constructor and (b) when used as the source argument of the move constructor the object is left in a state where it's destructor has no effect. Note that such a definition is currently mostly useless, since "trivial move constructor" (basically element-wise copy and nothing else) is not consistent with any modification of the source object. So for example, a trivial move constructor cannot set the ptr member of the source unique_ptr to zero. So you'd need to jump though some more hoops such as introducing the concept of a destructive move operation which leaves the source object destroyed, rather than in a valid-but-unspecified state.

You can find some more detailed discussion of this "trivially movable" on this thread on the ISO C++ usenet discussion group. In the particular, in the linked reply, the exact issue of vectors of unique_ptr is addressed:

It turns out many smart pointers (unique_ptr and shared_ptr included) fall into all three of those categories and by applying them you can have vectors of smart pointers with essentially zero overhead over raw pointers even in non-optimized debug builds.

BeeOnRope

Related questions
                            
                                What is #pragma used for?
                            
                                Disable g++ "note candidates are.." compiler message
                            
                                lightweight http server C++ [closed]
                            
                                Understanding Vertex Array Objects (glGenVertexArrays )
                            
                                When should I use string instead of stringstream?
                            
                                C++ lambda capture this vs capture by reference
                            
                                how to copy char * into a string and vice-versa
                            
                                Is there a standard C++ grammar?
                            
                                Error calling user-defined operator+ on temporary object when there are extra brackets
                            
                                What do we need std::as_const() for?
                            
                                Set shared_ptr with new_pointer that is old_pointer + offset
                            
                                The benefits / disadvantages of unity builds? [duplicate]
                            
                                What is the difference between asio::tcp::socket's async_read_some and async_receive?
                            
                                QSlider Value Changed Signal
                            
                                How to copy/create derived class instance from a pointer to a polymorphic base class?
                            
                                Reason for using non-type template parameter instead of regular parameter?
                            
                                Is it safe to use an enum in a bit field?
                            
                                Why am I getting "undefined reference to `dladdr'" even with -ldl for this simple program?
                            
                                Specifying 64-bit unsigned integer literals on 64-bit data models
                            
                                When is std::chrono epoch?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Performance of resizing std::vector<std::unique_ptr<T>>

Tags:

c++

compiler-optimization

c++11

vector

unique-ptr