I have this code for memcpy as part of my implementation of the standard C library which copies memory from src to dest one byte at a time:
void *memcpy(void *restrict dest, const void *restrict src, size_t len)
{
char *dp = (char *restrict)dest;
const char *sp = (const char *restrict)src;
while( len-- )
{
*dp++ = *sp++;
}
return dest;
}
With gcc -O2, the code generated is reasonable:
memcpy:
.LFB0:
movq %rdi, %rax
testq %rdx, %rdx
je .L2
xorl %ecx, %ecx
.L3:
movzbl (%rsi,%rcx), %r8d
movb %r8b, (%rax,%rcx)
addq $1, %rcx
cmpq %rdx, %rcx
jne .L3
.L2:
ret
.LFE0:
However, at gcc -O3, GCC optimizes this naive byte-for-byte copy into a memcpy call:
memcpy:
.LFB0:
testq %rdx, %rdx
je .L7
subq $8, %rsp
call memcpy
addq $8, %rsp
ret
.L7:
movq %rdi, %rax
ret
.LFE0:
This won't work (memcpy unconditionally calls itself), and it causes a segfault.
I've tried passing -fno-builtin-memcpy and -fno-loop-optimizations, and the same thing occurs.
I'm using GCC version 8.3.0:
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/x86_64-cros-linux-gnu/8.3.0/lto-wrapper
Target: x86_64-cros-linux-gnu
Configured with: ../configure --prefix=/usr/local --libdir=/usr/local/lib64 --build=x86_64-cros-linux-gnu --host=x86_64-cros-linux-gnu --target=x86_64-cros-linux-gnu --enable-checking=release --disable-multilib --enable-threads=posix --disable-bootstrap --disable-werror --disable-libmpx --enable-static --enable-shared --program-suffix=-8.3.0 --with-arch-64=x86-64
Thread model: posix
gcc version 8.3.0 (GCC)
How do I disable the optimization that causes the copy to be transformed into a memcpy call?
With a cold cache, optimized memcpy with write-back cache works best because the cache doesn't have to write to memory and so avoids any delays on the bus. For a garbage-filled cache, write-through caches work slightly better, because the cache doesn't need to spend extra cycles evicting irrelevant data to memory.
The memcpy () function copies count bytes of src to dest . The behavior is undefined if copying takes place between objects that overlap. The memmove () function allows copying between objects that might overlap.
Some version of gcc and icc tends to leave inline assembly touched variable intact. This becomes the second technique to prevent optimizations. For example, Facebook’s Folly library uses the following doNotOptimizeAway function to prevent optimizing an expression:
Cross-compiler vendors generally include a precompiled set of standard class libraries, including a basic implementation of memcpy () . Unfortunately, since this same code must run on hardware with a variety of processors and memory architectures, it can't be optimized for any specific architecture.
One thing that seems to be sufficient here: instead of using -fno-builtin-memcpy use -fno-builtin for compiling the translation unit of memcpy alone!
An alternative would be to pass -fno-tree-loop-distribute-patterns; though this might be brittle as it forbids the compiler from reorganizing the loop code first and then replacing part of them with calls to mem* functions.
Or, since you cannot rely anything in the C library, perhaps using -ffreestanding could be in order.
This won't work (memcpy unconditionally calls itself), and it causes a segfault.
Redefining memcpy is undefined behavior.
How do I disable the optimization that causes the copy to be transformed into a memcpy call (preferably while still compiling with -O3)?
Don't. The best approach is fixing your code instead:
In most cases, you should use another name.
In the rare case you are really implementing a C library (as discussed in the comments), and you really want to reimplement memcpy, then you should be using compiler-specific options to achieve that. For GCC, see -fno-builtin* and -ffreestanding, as well as -nodefaultlibs and -nostdlib.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With