Under what circumstances should I expect memcpys to outperform assignments on modern INTEL/AMD hardware? I am using GCC 4.2.x on a 32 bit Intel platform (but am interested in 64 bit as well).
The function memcpy() is used to copy a memory block from one location to another. One is source and another is destination pointed by the pointer. This is declared in “string.
In C/C++, we can assign a struct (or class in C++ only) variable to another variable of same type.
memcpy is usually naive - certainly not the slowest way to copy memory around, but usually quite easy to beat with some loop unrolling, and you can go even further with assembler.
memcpy is only faster if: BOTH buffers, src AND dst, are 4-byte aligned. if so, memcpy() can copy a 32bit word at a time (inside its own loop over the length) if just one buffer is NOT 32bit word aligned - it creates overhead to figure out and it will do at the end a single char copy loop.
You should never expect them outperform assignments. The reason is, the compiler will use memcpy anyway when it thinks it would be faster (if you use optimize flags). If not and if the structure is reasonable small that it fits into registers, direct register manipulation could be used which wouldn't require any memory access at all.
GCC has special block-move patterns internally that figure out when to directly change registers / memory cells, or when to use the memcpy function. Note when assigning the struct, the compiler knows at compile time how big the move is going to be, so it can unroll small copies (do a move n-times in row instead of looping) for instance. Note -mno-memcpy
:
-mmemcpy -mno-memcpy Force (do not force) the use of "memcpy()" for non-trivial block moves. The default is -mno-memcpy, which allows GCC to inline most constant-sized copies.
Who knows it better when to use memcpy than the compiler itself?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With