I have this code for <code>memcpy</code> as part of my implementation of the standard C library which copies memory from <code>src</code> to <code>dest</code> one byte at a time: <pre class="prettyprint"><code>void *memcpy(void *restrict dest, const void *restrict src, size_t len) { char *dp = (char *restrict)dest; const char *sp = (const char *restrict)src; while( len-- ) { *dp++ = *sp++; } return dest; } </code></pre> With <code>gcc -O2</code>, the code generated is reasonable: <pre class="prettyprint"><code>memcpy: .LFB0: movq %rdi, %rax testq %rdx, %rdx je .L2 xorl %ecx, %ecx .L3: movzbl (%rsi,%rcx), %r8d movb %r8b, (%rax,%rcx) addq $1, %rcx cmpq %rdx, %rcx jne .L3 .L2: ret .LFE0: </code></pre> However, at <code>gcc -O3</code>, GCC optimizes this naive byte-for-byte copy into a <code>memcpy</code> call: <pre class="prettyprint"><code>memcpy: .LFB0: testq %rdx, %rdx je .L7 subq $8, %rsp call memcpy addq $8, %rsp ret .L7: movq %rdi, %rax ret .LFE0: </code></pre> This won't work (<code>memcpy</code> unconditionally calls itself), and it causes a segfault. I've tried passing <code>-fno-builtin-memcpy</code> and <code>-fno-loop-optimizations</code>, and the same thing occurs. I'm using GCC version 8.3.0: <pre class="prettyprint"><code>Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/x86_64-cros-linux-gnu/8.3.0/lto-wrapper Target: x86_64-cros-linux-gnu Configured with: ../configure --prefix=/usr/local --libdir=/usr/local/lib64 --build=x86_64-cros-linux-gnu --host=x86_64-cros-linux-gnu --target=x86_64-cros-linux-gnu --enable-checking=release --disable-multilib --enable-threads=posix --disable-bootstrap --disable-werror --disable-libmpx --enable-static --enable-shared --program-suffix=-8.3.0 --with-arch-64=x86-64 Thread model: posix gcc version 8.3.0 (GCC) </code></pre> How do I disable the optimization that causes the copy to be transformed into a <code>memcpy</code> call?

One thing that seems to be sufficient here: instead of using <code>-fno-builtin-memcpy</code> use <code>-fno-builtin</code> for compiling the translation unit of <code>memcpy</code> alone! An alternative would be to pass <code>-fno-tree-loop-distribute-patterns</code>; though this might be brittle as it forbids the compiler from reorganizing the loop code first and then replacing part of them with calls to <code>mem*</code> functions. Or, since you cannot rely anything in the C library, perhaps using <code>-ffreestanding</code> could be in order.

<blockquote> This won't work (memcpy unconditionally calls itself), and it causes a segfault. </blockquote> Redefining <code>memcpy</code> is undefined behavior. <blockquote> How do I disable the optimization that causes the copy to be transformed into a memcpy call (preferably while still compiling with -O3)? </blockquote> Don't. The best approach is fixing your code instead: <ul> <li>In most cases, you should use another name.</li> <li>In the rare case you are really implementing a C library (as discussed in the comments), and you really want to reimplement <code>memcpy</code>, then you should be using compiler-specific options to achieve that. For GCC, see <code>-fno-builtin*</code> and <code>-ffreestanding</code>, as well as <code>-nodefaultlibs</code> and <code>-nostdlib</code>.</li> </ul>

How do I stop GCC from optimizing this byte-for-byte copy into a memcpy call?

I have this code for memcpy as part of my implementation of the standard C library which copies memory from src to dest one byte at a time:

void *memcpy(void *restrict dest, const void *restrict src, size_t len)
{
    char *dp = (char *restrict)dest;
    const char *sp = (const char *restrict)src;

    while( len-- )
    {
        *dp++ = *sp++;
    }

    return dest;
}

With gcc -O2, the code generated is reasonable:

memcpy:
.LFB0:
        movq    %rdi, %rax
        testq   %rdx, %rdx
        je      .L2
        xorl    %ecx, %ecx
.L3:
        movzbl  (%rsi,%rcx), %r8d
        movb    %r8b, (%rax,%rcx)
        addq    $1, %rcx
        cmpq    %rdx, %rcx
        jne     .L3
.L2:
        ret
.LFE0:

However, at gcc -O3, GCC optimizes this naive byte-for-byte copy into a memcpy call:

memcpy:
.LFB0:
        testq   %rdx, %rdx
        je      .L7
        subq    $8, %rsp
        call    memcpy
        addq    $8, %rsp
        ret
.L7:
        movq    %rdi, %rax
        ret
.LFE0:

This won't work (memcpy unconditionally calls itself), and it causes a segfault.

I've tried passing -fno-builtin-memcpy and -fno-loop-optimizations, and the same thing occurs.

I'm using GCC version 8.3.0:

Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/x86_64-cros-linux-gnu/8.3.0/lto-wrapper
Target: x86_64-cros-linux-gnu
Configured with: ../configure --prefix=/usr/local --libdir=/usr/local/lib64 --build=x86_64-cros-linux-gnu --host=x86_64-cros-linux-gnu --target=x86_64-cros-linux-gnu --enable-checking=release --disable-multilib --enable-threads=posix --disable-bootstrap --disable-werror --disable-libmpx --enable-static --enable-shared --program-suffix=-8.3.0 --with-arch-64=x86-64
Thread model: posix
gcc version 8.3.0 (GCC)

How do I disable the optimization that causes the copy to be transformed into a memcpy call?

What is the best way to optimize memcpy for a cache?

With a cold cache, optimized memcpy with write-back cache works best because the cache doesn't have to write to memory and so avoids any delays on the bus. For a garbage-filled cache, write-through caches work slightly better, because the cache doesn't need to spend extra cycles evicting irrelevant data to memory.

What is the difference between memcpy() and memmove() in C++?

The memcpy () function copies count bytes of src to dest . The behavior is undefined if copying takes place between objects that overlap. The memmove () function allows copying between objects that might overlap.

How to prevent compiler optimization in GCC and ICC?

Some version of gcc and icc tends to leave inline assembly touched variable intact. This becomes the second technique to prevent optimizations. For example, Facebook’s Folly library uses the following doNotOptimizeAway function to prevent optimizing an expression:

Is there a precompiled version of memcpy?

Cross-compiler vendors generally include a precompiled set of standard class libraries, including a basic implementation of memcpy () . Unfortunately, since this same code must run on hardware with a variety of processors and memory architectures, it can't be optimized for any specific architecture.

One thing that seems to be sufficient here: instead of using -fno-builtin-memcpy use -fno-builtin for compiling the translation unit of memcpy alone!

An alternative would be to pass -fno-tree-loop-distribute-patterns; though this might be brittle as it forbids the compiler from reorganizing the loop code first and then replacing part of them with calls to mem* functions.

Or, since you cannot rely anything in the C library, perhaps using -ffreestanding could be in order.

This won't work (memcpy unconditionally calls itself), and it causes a segfault.

Redefining memcpy is undefined behavior.

How do I disable the optimization that causes the copy to be transformed into a memcpy call (preferably while still compiling with -O3)?

Don't. The best approach is fixing your code instead:

In most cases, you should use another name.
In the rare case you are really implementing a C library (as discussed in the comments), and you really want to reimplement memcpy, then you should be using compiler-specific options to achieve that. For GCC, see -fno-builtin* and -ffreestanding, as well as -nodefaultlibs and -nostdlib.

How do I stop GCC from optimizing this byte-for-byte copy into a memcpy call?

Tags:

c

compiler-optimization

gcc

S.S. Anne

People also ask

2 Answers

Antti Haapala -- Слава Україні

Acorn

Recent Activity

Donate For Us

How do I stop GCC from optimizing this byte-for-byte copy into a memcpy call?

Tags:

c

compiler-optimization

gcc

S.S. Anne

People also ask

2 Answers

Antti Haapala -- Слава Україні

Acorn

Related questions

Recent Activity

Donate For Us