With gcc 5.3, both functions in the following example generate a call to memmove
. Would it be inappropriate to generate a memcpy
?
#include <vector>
int blackhole(const std::vector<int>&);
int copy_vec1(const std::vector<int>& v1) {
const std::vector<int> v2{v1.begin(), v1.end()};
return blackhole(v2);
}
int copy_vec2(const std::vector<int>& v1) {
const auto v2 = v1;
return blackhole(v2);
}
Example on godbolt.
I tried compiling this code using g++ 6.1.0. I'm not at all certain of the details, but I think that the memmove
call is not directly generated by the compiler; rather, it's in the code that implements <vector>
.
When I preprocess the code using
/o/apps/gcc-6.1.0/bin/g++ -E -std=c++14 c.cpp
I see two calls to __builtin_memmove
, both coming from .../include/c++/6.1.0/bits/stl_algobase.h
. Looking at that header file, I see this comment:
// All of these auxiliary structs serve two purposes. (1) Replace
// calls to copy with memmove whenever possible. (Memmove, not memcpy,
// because the input and output ranges are permitted to overlap.)
// (2) If we're using random access iterators, then write the loop as
// a for loop with an explicit count.
I think what's happening is that the code invoked for copying the vector is more generally applicable to copies that can overlap (such as a call to std::move
(?)).
(I have not confirmed that the memmove
calls that appear in the assembly listing correspond to the __builtin_memmove
calls in stl_algobase.h
. I invite anyone else to follow up on that point.)
Depending on the implementation, memmove()
can have some overhead relative to memcpy()
, but the difference is minor. It probably just wasn't worth creating special-case code for copies that can't overlap.
TL;DR GCC doesn't optimize the call to memmove
inside std::copy
. When using two C-style arrays, it does. Replacing &v2[0]
with *v2.data()
allows it to optimize into a memcpy
.
Your example is pretty noisy so let's strip it down:
#include <vector>
#include <algorithm>
int a[5];
int b[5];
std::vector<int> v2;
I deliberately put the variables at file scope to prevent optimizing them away without having to deal with volatile
semantics.
First let's try:
std::copy(&a[0], &a[5], &b[0]);
With -O3 -fdump-tree-optimized
this becomes:
__builtin_memcpy (&b[0], &a[0], 20);
Stepping through GDB shows us:
Breakpoint 1, main () at test.cpp:9
9 std::copy(&a[0], &a[0] + 5, &b[0]);
(gdb) s
std::copy<int*, int*> (__result=0x601080 <b>, __last=0x6010b4, __first=0x6010a0 <a>) at test.cpp:9
9 std::copy(&a[0], &a[0] + 5, &b[0]);
(gdb) s
std::__copy_move_a2<false, int*, int*> (__result=0x601080 <b>, __last=0x6010b4, __first=0x6010a0 <a>) at test.cpp:9
9 std::copy(&a[0], &a[0] + 5, &b[0]);
(gdb) s
std::__copy_move_a<false, int*, int*> (__result=<optimized out>, __last=<optimized out>, __first=<optimized out>) at test.cpp:9
9 std::copy(&a[0], &a[0] + 5, &b[0]);
(gdb) s
std::__copy_move<false, true, std::random_access_iterator_tag>::__copy_m<int> (__result=<optimized out>, __last=<optimized out>,
__first=<optimized out>) at /usr/include/c++/5.3.1/bits/stl_algobase.h:382
382 __builtin_memmove(__result, __first, sizeof(_Tp) * _Num);
(gdb) s
main () at test.cpp:10
10 }
Wait it used memmove
?! OK let's keep going.
What about:
std::copy(&a[0], &a[5], v2.begin());
OK that gets us memmove
:
int * _2;
<bb 2>:
_2 = MEM[(int * const &)&v2];
__builtin_memmove (_2, &a[0], 20);
Which is reflected in the assembly if we do -S
. Stepping through GDB shows us the process:
(gdb)
Breakpoint 1, main () at test.cpp:9
9 {
(gdb) s
10 std::copy(&a[0], &a[5], &v2[0]);
(gdb) s
std::copy<int*, int*> (__result=<optimized out>, __last=0x6010d4, __first=0x6010c0 <a>) at test.cpp:10
10 std::copy(&a[0], &a[5], &v2[0]);
(gdb) s
std::__copy_move_a2<false, int*, int*> (__result=<optimized out>, __last=0x6010d4, __first=0x6010c0 <a>) at test.cpp:10
10 std::copy(&a[0], &a[5], &v2[0]);
(gdb) s
std::__copy_move_a<false, int*, int*> (__result=<optimized out>, __last=<optimized out>, __first=<optimized out>) at test.cpp:10
10 std::copy(&a[0], &a[5], &v2[0]);
(gdb) s
std::__copy_move<false, true, std::random_access_iterator_tag>::__copy_m<int> (__result=<optimized out>, __last=<optimized out>,
__first=<optimized out>) at /usr/include/c++/5.3.1/bits/stl_algobase.h:382
382 __builtin_memmove(__result, __first, sizeof(_Tp) * _Num);
(gdb) s
__memmove_ssse3 () at ../sysdeps/x86_64/multiarch/memcpy-ssse3.S:55
Ah I see. It's using an optimized memcpy
routine provided by the C library. But wait a minute, that doesn't makes sense. memmove
and memcpy
are two different things!
Looking at the source code for this routine we see little checks sprinkled through out:
85 #ifndef USE_AS_MEMMOVE
86 cmp %dil, %sil
87 jle L(copy_backward)
88 #endif
GDB confirms that it is treating it as an memmove
:
55 mov %rdi, %rax
(gdb) s
61 cmp %rsi, %rdi
(gdb) s
62 jb L(copy_forward)
(gdb) s
63 je L(write_0bytes)
But if we replace &v2[0]
with *v2.data()
it doesn't call the GLIBC's memmove
. So what's going on?
Well v2[0]
and v2.begin()
return iterators while v2.data()
returns a direct pointer to the memory. I think this for some reason prevents GCC from optimizing the memmove
into a memcpy
.[citation needed]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With