I have this code for memcpy
as part of my implementation of the standard C library which copies memory from src
to dest
one byte at a time:
void *memcpy(void *restrict dest, const void *restrict src, size_t len)
{
char *dp = (char *restrict)dest;
const char *sp = (const char *restrict)src;
while( len-- )
{
*dp++ = *sp++;
}
return dest;
}
With gcc -O2
, the code generated is reasonable:
memcpy:
.LFB0:
movq %rdi, %rax
testq %rdx, %rdx
je .L2
xorl %ecx, %ecx
.L3:
movzbl (%rsi,%rcx), %r8d
movb %r8b, (%rax,%rcx)
addq $1, %rcx
cmpq %rdx, %rcx
jne .L3
.L2:
ret
.LFE0:
However, at gcc -O3
, GCC optimizes this naive byte-for-byte copy into a memcpy
call:
memcpy:
.LFB0:
testq %rdx, %rdx
je .L7
subq $8, %rsp
call memcpy
addq $8, %rsp
ret
.L7:
movq %rdi, %rax
ret
.LFE0:
This won't work (memcpy
unconditionally calls itself), and it causes a segfault.
I've tried passing -fno-builtin-memcpy
and -fno-loop-optimizations
, and the same thing occurs.
I'm using GCC version 8.3.0:
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/x86_64-cros-linux-gnu/8.3.0/lto-wrapper
Target: x86_64-cros-linux-gnu
Configured with: ../configure --prefix=/usr/local --libdir=/usr/local/lib64 --build=x86_64-cros-linux-gnu --host=x86_64-cros-linux-gnu --target=x86_64-cros-linux-gnu --enable-checking=release --disable-multilib --enable-threads=posix --disable-bootstrap --disable-werror --disable-libmpx --enable-static --enable-shared --program-suffix=-8.3.0 --with-arch-64=x86-64
Thread model: posix
gcc version 8.3.0 (GCC)
How do I disable the optimization that causes the copy to be transformed into a memcpy
call?
With a cold cache, optimized memcpy with write-back cache works best because the cache doesn't have to write to memory and so avoids any delays on the bus. For a garbage-filled cache, write-through caches work slightly better, because the cache doesn't need to spend extra cycles evicting irrelevant data to memory.
The memcpy () function copies count bytes of src to dest . The behavior is undefined if copying takes place between objects that overlap. The memmove () function allows copying between objects that might overlap.
Some version of gcc and icc tends to leave inline assembly touched variable intact. This becomes the second technique to prevent optimizations. For example, Facebook’s Folly library uses the following doNotOptimizeAway function to prevent optimizing an expression:
Cross-compiler vendors generally include a precompiled set of standard class libraries, including a basic implementation of memcpy () . Unfortunately, since this same code must run on hardware with a variety of processors and memory architectures, it can't be optimized for any specific architecture.
One thing that seems to be sufficient here: instead of using -fno-builtin-memcpy
use -fno-builtin
for compiling the translation unit of memcpy
alone!
An alternative would be to pass -fno-tree-loop-distribute-patterns
; though this might be brittle as it forbids the compiler from reorganizing the loop code first and then replacing part of them with calls to mem*
functions.
Or, since you cannot rely anything in the C library, perhaps using -ffreestanding
could be in order.
This won't work (memcpy unconditionally calls itself), and it causes a segfault.
Redefining memcpy
is undefined behavior.
How do I disable the optimization that causes the copy to be transformed into a memcpy call (preferably while still compiling with -O3)?
Don't. The best approach is fixing your code instead:
In most cases, you should use another name.
In the rare case you are really implementing a C library (as discussed in the comments), and you really want to reimplement memcpy
, then you should be using compiler-specific options to achieve that. For GCC, see -fno-builtin*
and -ffreestanding
, as well as -nodefaultlibs
and -nostdlib
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With