I would like to copy a relatively short sequence of memory (less than 1 KB, typically 2-200 bytes) in a time critical function. The best code for this on CPU side seems to be rep movsd
. However I somehow cannot make my compiler to generate this code. I hoped (and I vaguely remember seeing so) using memcpy would do this using compiler built-in intrinsics, but based on disassembly and debugging it seems compiler is using call to memcpy/memmove library implementation instead. I also hoped the compiler might be smart enough to recognize following loop and use rep movsd
on its own, but it seems it does not.
char *dst;
const char *src;
// ...
for (int r=size; --r>=0; ) *dst++ = *src++;
Is there some way to make the Visual Studio compiler to generate rep movsd
sequence other than using inline assembly?
Several questions come to mind.
First, how do you know movsd would be faster? Have you looked up its latency/throughput? The x86 architecture is full of crufty old instructions that should not be used because they're just not very efficient on modern CPU's.
Second, what happens if you use std::copy
instead of memcpy? std::copy
is potentially faster, as it can be specialized at compile-time for the specific data type.
And third, have you enabled intrinsic functions under project properties -> C/C++ -> Optimization?
Of course I assume other optimizations are enabled as well.
Are you running an optimised build? It won't use an intrinsic unless optimisation is on. Its also worth noting that it will probably use a better copy loop than rep movsd. It should try and use MMX, at the least, to perform a 64-bit at a time copy. In fact 6 or 7 years back I wrote an MMX optimised copy loop for doing this sort of thing. Unfortunately the compiler's intrinsic memcpy outperformed my MMX copy by about 1%. That really taught me not to make assumptions about what the compiler is doing.
What I have found meanwhile:
Compiler will use intrinsic when the copied block size is compile time known. When it is not, is calls the library implementation. When the size is known, the code generated is very nice, selected based on the size. It may be a single mov, or movsd, or movsd followed by movsb, as needed.
It seems that if I really want to use movsb or movsd always, even with a "dynamic" size I will have to use inline assembly or special intrinsic (see below). I know the size is "quite short", but the compiler does not know it and I cannot communicate this to it - I have even tried to use __assume(size<16), but it is not enough.
Demo code, compile with "-Ob1 (expansion for inline only):
#include <memory.h>
void MemCpyTest(void *tgt, const void *src, size_t size)
{
memcpy(tgt,src,size);
}
template <int size>
void MemCpyTestT(void *tgt, const void *src)
{
memcpy(tgt,src,size);
}
int main ( int argc, char **argv )
{
int src;
int dst;
MemCpyTest(&dst,&src,sizeof(dst));
MemCpyTestT<sizeof(dst)>(&dst,&src);
return 0;
}
I have found recently there exists very simple way how to make Visual Studio compiler copy characters using movsd - very natural and simple: using intrinsics. Following intrinsics may come handy:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With