Why does gcc generate a memmove instead of a memcpy for copying a std::vector?

Question

With gcc 5.3, both functions in the following example generate a call to memmove. Would it be inappropriate to generate a memcpy?

#include <vector>

int blackhole(const std::vector<int>&);

int copy_vec1(const std::vector<int>& v1) {
    const std::vector<int> v2{v1.begin(), v1.end()};
    return blackhole(v2);
}

int copy_vec2(const std::vector<int>& v1) {
    const auto v2 = v1;
    return blackhole(v2);
}

Example on godbolt.

Keith Thompson · Accepted Answer

I tried compiling this code using g++ 6.1.0. I'm not at all certain of the details, but I think that the memmove call is not directly generated by the compiler; rather, it's in the code that implements <vector>.

When I preprocess the code using

/o/apps/gcc-6.1.0/bin/g++ -E -std=c++14 c.cpp

I see two calls to __builtin_memmove, both coming from .../include/c++/6.1.0/bits/stl_algobase.h. Looking at that header file, I see this comment:

// All of these auxiliary structs serve two purposes.  (1) Replace
// calls to copy with memmove whenever possible.  (Memmove, not memcpy,
// because the input and output ranges are permitted to overlap.)
// (2) If we're using random access iterators, then write the loop as
// a for loop with an explicit count.

I think what's happening is that the code invoked for copying the vector is more generally applicable to copies that can overlap (such as a call to std::move(?)).

(I have not confirmed that the memmove calls that appear in the assembly listing correspond to the __builtin_memmove calls in stl_algobase.h. I invite anyone else to follow up on that point.)

Depending on the implementation, memmove() can have some overhead relative to memcpy(), but the difference is minor. It probably just wasn't worth creating special-case code for copies that can't overlap.

user6323422 · Answer

TL;DR GCC doesn't optimize the call to memmove inside std::copy. When using two C-style arrays, it does. Replacing &v2[0] with *v2.data() allows it to optimize into a memcpy.

Your example is pretty noisy so let's strip it down:

#include <vector>
#include <algorithm>

int a[5];
int b[5];
std::vector<int> v2;

I deliberately put the variables at file scope to prevent optimizing them away without having to deal with volatile semantics.

First let's try:

std::copy(&a[0], &a[5], &b[0]);

With -O3 -fdump-tree-optimized this becomes:

__builtin_memcpy (&b[0], &a[0], 20);

Stepping through GDB shows us:

Breakpoint 1, main () at test.cpp:9
9       std::copy(&a[0], &a[0] + 5, &b[0]);
(gdb) s
std::copy<int*, int*> (__result=0x601080 <b>, __last=0x6010b4, __first=0x6010a0 <a>) at test.cpp:9
9       std::copy(&a[0], &a[0] + 5, &b[0]);
(gdb) s
std::__copy_move_a2<false, int*, int*> (__result=0x601080 <b>, __last=0x6010b4, __first=0x6010a0 <a>) at test.cpp:9
9       std::copy(&a[0], &a[0] + 5, &b[0]);
(gdb) s
std::__copy_move_a<false, int*, int*> (__result=<optimized out>, __last=<optimized out>, __first=<optimized out>) at test.cpp:9
9       std::copy(&a[0], &a[0] + 5, &b[0]);
(gdb) s
std::__copy_move<false, true, std::random_access_iterator_tag>::__copy_m<int> (__result=<optimized out>, __last=<optimized out>, 
    __first=<optimized out>) at /usr/include/c++/5.3.1/bits/stl_algobase.h:382
382         __builtin_memmove(__result, __first, sizeof(_Tp) * _Num);
(gdb) s
main () at test.cpp:10
10  }

Wait it used memmove?! OK let's keep going.

What about:

std::copy(&a[0], &a[5], v2.begin());

OK that gets us memmove:

int * _2;

<bb 2>:
_2 = MEM[(int * const &)&v2];
__builtin_memmove (_2, &a[0], 20);

Which is reflected in the assembly if we do -S. Stepping through GDB shows us the process:

(gdb) 
Breakpoint 1, main () at test.cpp:9
9   {
(gdb) s
10      std::copy(&a[0], &a[5], &v2[0]);
(gdb) s
std::copy<int*, int*> (__result=<optimized out>, __last=0x6010d4, __first=0x6010c0 <a>) at test.cpp:10
10      std::copy(&a[0], &a[5], &v2[0]);
(gdb) s
std::__copy_move_a2<false, int*, int*> (__result=<optimized out>, __last=0x6010d4, __first=0x6010c0 <a>) at test.cpp:10
10      std::copy(&a[0], &a[5], &v2[0]);
(gdb) s
std::__copy_move_a<false, int*, int*> (__result=<optimized out>, __last=<optimized out>, __first=<optimized out>) at test.cpp:10
10      std::copy(&a[0], &a[5], &v2[0]);
(gdb) s
std::__copy_move<false, true, std::random_access_iterator_tag>::__copy_m<int> (__result=<optimized out>, __last=<optimized out>, 
    __first=<optimized out>) at /usr/include/c++/5.3.1/bits/stl_algobase.h:382
382         __builtin_memmove(__result, __first, sizeof(_Tp) * _Num);
(gdb) s
__memmove_ssse3 () at ../sysdeps/x86_64/multiarch/memcpy-ssse3.S:55

Ah I see. It's using an optimized memcpy routine provided by the C library. But wait a minute, that doesn't makes sense. memmove and memcpy are two different things!

Looking at the source code for this routine we see little checks sprinkled through out:

  85 #ifndef USE_AS_MEMMOVE
  86         cmp     %dil, %sil
  87         jle     L(copy_backward)
  88 #endif

GDB confirms that it is treating it as an memmove:

55      mov %rdi, %rax
(gdb) s
61      cmp %rsi, %rdi
(gdb) s
62      jb  L(copy_forward)
(gdb) s
63      je  L(write_0bytes)

But if we replace &v2[0] with *v2.data() it doesn't call the GLIBC's memmove. So what's going on?

Well v2[0] and v2.begin() return iterators while v2.data() returns a direct pointer to the memory. I think this for some reason prevents GCC from optimizing the memmove into a memcpy.^{[citation needed]}

Why does gcc generate a memmove instead of a memcpy for copying a std::vector<>?

Tags:

c++

performance

gcc

assembly

c++14

Praxeolitic

2 Answers

Keith Thompson

user6323422

Recent Activity

Donate For Us