I wrote a simple test (code down the bottom) to benchmark the performance of memcpy
on my 64-bit Debian system. On my system when compiled as a 64-bit binary this gives a consistent 38-40GB/s across all block sizes. However when built as a 32-bit binary on the same system the copy performance is abysmal.
I wrote my own memcpy implementation in assembler that leverages SIMD which is able to match the 64-bit performance. I am honestly shocked that my own memcpy is so much faster then the native, surely something must be wrong with the 32-bit libc build.
0x00100000 B, 0.034215 ms, 29227.06 MB/s (16384 iterations)
0x00200000 B, 0.033453 ms, 29892.56 MB/s ( 8192 iterations)
0x00300000 B, 0.048710 ms, 20529.48 MB/s ( 5461 iterations)
0x00400000 B, 0.049187 ms, 20330.54 MB/s ( 4096 iterations)
0x00500000 B, 0.058945 ms, 16965.01 MB/s ( 3276 iterations)
0x00600000 B, 0.060735 ms, 16465.01 MB/s ( 2730 iterations)
0x00700000 B, 0.068973 ms, 14498.34 MB/s ( 2340 iterations)
0x00800000 B, 0.078325 ms, 12767.34 MB/s ( 2048 iterations)
0x00900000 B, 0.099801 ms, 10019.92 MB/s ( 1820 iterations)
0x00a00000 B, 0.111160 ms, 8996.04 MB/s ( 1638 iterations)
0x00b00000 B, 0.120044 ms, 8330.31 MB/s ( 1489 iterations)
0x00c00000 B, 0.116506 ms, 8583.26 MB/s ( 1365 iterations)
0x00d00000 B, 0.120322 ms, 8311.06 MB/s ( 1260 iterations)
0x00e00000 B, 0.114424 ms, 8739.40 MB/s ( 1170 iterations)
0x00f00000 B, 0.128843 ms, 7761.37 MB/s ( 1092 iterations)
0x01000000 B, 0.118122 ms, 8465.85 MB/s ( 1024 iterations)
0x08000000 B, 0.140218 ms, 7131.76 MB/s ( 128 iterations)
0x10000000 B, 0.115596 ms, 8650.85 MB/s ( 64 iterations)
0x20000000 B, 0.115325 ms, 8671.16 MB/s ( 32 iterations)
0x00100000 B, 0.022237 ms, 44970.48 MB/s (16384 iterations)
0x00200000 B, 0.022293 ms, 44856.77 MB/s ( 8192 iterations)
0x00300000 B, 0.021729 ms, 46022.49 MB/s ( 5461 iterations)
0x00400000 B, 0.028348 ms, 35275.28 MB/s ( 4096 iterations)
0x00500000 B, 0.026118 ms, 38288.08 MB/s ( 3276 iterations)
0x00600000 B, 0.026161 ms, 38225.15 MB/s ( 2730 iterations)
0x00700000 B, 0.026199 ms, 38169.68 MB/s ( 2340 iterations)
0x00800000 B, 0.026236 ms, 38116.22 MB/s ( 2048 iterations)
0x00900000 B, 0.026090 ms, 38329.50 MB/s ( 1820 iterations)
0x00a00000 B, 0.026085 ms, 38336.39 MB/s ( 1638 iterations)
0x00b00000 B, 0.026079 ms, 38345.59 MB/s ( 1489 iterations)
0x00c00000 B, 0.026147 ms, 38245.75 MB/s ( 1365 iterations)
0x00d00000 B, 0.026033 ms, 38412.69 MB/s ( 1260 iterations)
0x00e00000 B, 0.026037 ms, 38407.40 MB/s ( 1170 iterations)
0x00f00000 B, 0.026019 ms, 38433.80 MB/s ( 1092 iterations)
0x01000000 B, 0.026041 ms, 38401.61 MB/s ( 1024 iterations)
0x08000000 B, 0.026123 ms, 38280.89 MB/s ( 128 iterations)
0x10000000 B, 0.026083 ms, 38338.70 MB/s ( 64 iterations)
0x20000000 B, 0.026116 ms, 38290.93 MB/s ( 32 iterations)
0x00100000 B, 0.026807 ms, 37303.21 MB/s (16384 iterations)
0x00200000 B, 0.026500 ms, 37735.59 MB/s ( 8192 iterations)
0x00300000 B, 0.026810 ms, 37300.04 MB/s ( 5461 iterations)
0x00400000 B, 0.026214 ms, 38148.05 MB/s ( 4096 iterations)
0x00500000 B, 0.026738 ms, 37399.74 MB/s ( 3276 iterations)
0x00600000 B, 0.026035 ms, 38409.15 MB/s ( 2730 iterations)
0x00700000 B, 0.026262 ms, 38077.29 MB/s ( 2340 iterations)
0x00800000 B, 0.026190 ms, 38183.00 MB/s ( 2048 iterations)
0x00900000 B, 0.026287 ms, 38042.18 MB/s ( 1820 iterations)
0x00a00000 B, 0.026263 ms, 38076.66 MB/s ( 1638 iterations)
0x00b00000 B, 0.026162 ms, 38223.48 MB/s ( 1489 iterations)
0x00c00000 B, 0.026189 ms, 38183.45 MB/s ( 1365 iterations)
0x00d00000 B, 0.026012 ms, 38444.52 MB/s ( 1260 iterations)
0x00e00000 B, 0.026089 ms, 38330.05 MB/s ( 1170 iterations)
0x00f00000 B, 0.026373 ms, 37917.10 MB/s ( 1092 iterations)
0x01000000 B, 0.026304 ms, 38016.85 MB/s ( 1024 iterations)
0x08000000 B, 0.025958 ms, 38523.59 MB/s ( 128 iterations)
0x10000000 B, 0.025992 ms, 38473.84 MB/s ( 64 iterations)
0x20000000 B, 0.026020 ms, 38431.96 MB/s ( 32 iterations)
(compile with: gcc -m32 -march=native -O3
)
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <stdint.h>
#include <malloc.h>
static inline uint64_t nanotime()
{
struct timespec time;
clock_gettime(CLOCK_MONOTONIC_RAW, &time);
return ((uint64_t)time.tv_sec * 1e9) + time.tv_nsec;
}
void test(const int size)
{
char * buffer1 = memalign(128, size);
char * buffer2 = memalign(128, size);
for(int i = 0; i < size; ++i)
buffer2[i] = i;
uint64_t t = nanotime();
const uint64_t loops = (16384LL * 1048576LL) / size;
for(uint64_t i = 0; i < loops; ++i)
memcpy(buffer1, buffer2, size);
double ms = (((float)(nanotime() - t) / loops) / 1000000.0f) / (size / 1024 / 1024);
printf("0x%08x B, %8.6f ms, %8.2f MB/s (%5llu iterations)\n", size, ms, 1000.0 / ms, loops);
// prevent the compiler from trying to optimize out the copy
if (buffer1[0] == 0x0)
return;
free(buffer1);
free(buffer2);
}
int main(int argc, char * argv[])
{
for(int i = 0; i < 16; ++i)
test((i+1) * 1024 * 1024);
test(128 * 1024 * 1024);
test(256 * 1024 * 1024);
test(512 * 1024 * 1024);
return 0;
}
99.68% x32.n.bin x32.n.bin [.] test
0.28% x32.n.bin [kernel.kallsyms] [k] clear_page_rep
0.01% x32.n.bin [kernel.kallsyms] [k] get_page_from_freelist
0.01% x32.n.bin [kernel.kallsyms] [k] __mod_node_page_state
0.01% x32.n.bin [kernel.kallsyms] [k] page_fault
0.00% x32.n.bin [kernel.kallsyms] [k] default_send_IPI_single
0.00% perf_4.17 [kernel.kallsyms] [k] __x86_indirect_thunk_r14
inline static void memcpySSE(void *dst, const void * src, size_t length)
{
#if (defined(__x86_64__) || defined(__i386__))
if (length == 0 || dst == src)
return;
#ifdef __x86_64__
const void * end = dst + (length & ~0xFF);
size_t off = (15 - ((length & 0xFF) >> 4));
off = (off < 8) ? off * 16 : 7 * 16 + (off - 7) * 10;
#else
const void * end = dst + (length & ~0x7F);
const size_t off = (7 - ((length & 0x7F) >> 4)) * 10;
#endif
#ifdef __x86_64__
#define REG "rax"
#else
#define REG "eax"
#endif
__asm__ __volatile__ (
"cmp %[dst],%[end] \n\t"
"je Remain_%= \n\t"
// perform SIMD block copy
"loop_%=: \n\t"
"vmovaps 0x00(%[src]),%%xmm0 \n\t"
"vmovaps 0x10(%[src]),%%xmm1 \n\t"
"vmovaps 0x20(%[src]),%%xmm2 \n\t"
"vmovaps 0x30(%[src]),%%xmm3 \n\t"
"vmovaps 0x40(%[src]),%%xmm4 \n\t"
"vmovaps 0x50(%[src]),%%xmm5 \n\t"
"vmovaps 0x60(%[src]),%%xmm6 \n\t"
"vmovaps 0x70(%[src]),%%xmm7 \n\t"
#ifdef __x86_64__
"vmovaps 0x80(%[src]),%%xmm8 \n\t"
"vmovaps 0x90(%[src]),%%xmm9 \n\t"
"vmovaps 0xA0(%[src]),%%xmm10 \n\t"
"vmovaps 0xB0(%[src]),%%xmm11 \n\t"
"vmovaps 0xC0(%[src]),%%xmm12 \n\t"
"vmovaps 0xD0(%[src]),%%xmm13 \n\t"
"vmovaps 0xE0(%[src]),%%xmm14 \n\t"
"vmovaps 0xF0(%[src]),%%xmm15 \n\t"
#endif
"vmovntdq %%xmm0 ,0x00(%[dst]) \n\t"
"vmovntdq %%xmm1 ,0x10(%[dst]) \n\t"
"vmovntdq %%xmm2 ,0x20(%[dst]) \n\t"
"vmovntdq %%xmm3 ,0x30(%[dst]) \n\t"
"vmovntdq %%xmm4 ,0x40(%[dst]) \n\t"
"vmovntdq %%xmm5 ,0x50(%[dst]) \n\t"
"vmovntdq %%xmm6 ,0x60(%[dst]) \n\t"
"vmovntdq %%xmm7 ,0x70(%[dst]) \n\t"
#ifdef __x86_64__
"vmovntdq %%xmm8 ,0x80(%[dst]) \n\t"
"vmovntdq %%xmm9 ,0x90(%[dst]) \n\t"
"vmovntdq %%xmm10,0xA0(%[dst]) \n\t"
"vmovntdq %%xmm11,0xB0(%[dst]) \n\t"
"vmovntdq %%xmm12,0xC0(%[dst]) \n\t"
"vmovntdq %%xmm13,0xD0(%[dst]) \n\t"
"vmovntdq %%xmm14,0xE0(%[dst]) \n\t"
"vmovntdq %%xmm15,0xF0(%[dst]) \n\t"
"add $0x100,%[dst] \n\t"
"add $0x100,%[src] \n\t"
#else
"add $0x80,%[dst] \n\t"
"add $0x80,%[src] \n\t"
#endif
"cmp %[dst],%[end] \n\t"
"jne loop_%= \n\t"
"Remain_%=: \n\t"
// copy any remaining 16 byte blocks
#ifdef __x86_64__
"leaq (%%rip), %%rax\n\t"
#else
"call GetPC_%=\n\t"
#endif
"Offset_%=:\n\t"
"add $(BlockTable_%= - Offset_%=), %%" REG "\n\t"
"add %[off],%%" REG " \n\t"
"jmp *%%" REG " \n\t"
#ifdef __i386__
"GetPC_%=:\n\t"
"mov (%%esp), %%eax \n\t"
"ret \n\t"
#endif
"BlockTable_%=:\n\t"
#ifdef __x86_64__
"vmovaps 0xE0(%[src]),%%xmm14 \n\t"
"vmovntdq %%xmm14,0xE0(%[dst]) \n\t"
"vmovaps 0xD0(%[src]),%%xmm13 \n\t"
"vmovntdq %%xmm13,0xD0(%[dst]) \n\t"
"vmovaps 0xC0(%[src]),%%xmm12 \n\t"
"vmovntdq %%xmm12,0xC0(%[dst]) \n\t"
"vmovaps 0xB0(%[src]),%%xmm11 \n\t"
"vmovntdq %%xmm11,0xB0(%[dst]) \n\t"
"vmovaps 0xA0(%[src]),%%xmm10 \n\t"
"vmovntdq %%xmm10,0xA0(%[dst]) \n\t"
"vmovaps 0x90(%[src]),%%xmm9 \n\t"
"vmovntdq %%xmm9 ,0x90(%[dst]) \n\t"
"vmovaps 0x80(%[src]),%%xmm8 \n\t"
"vmovntdq %%xmm8 ,0x80(%[dst]) \n\t"
"vmovaps 0x70(%[src]),%%xmm7 \n\t"
"vmovntdq %%xmm7 ,0x70(%[dst]) \n\t"
#endif
"vmovaps 0x60(%[src]),%%xmm6 \n\t"
"vmovntdq %%xmm6 ,0x60(%[dst]) \n\t"
"vmovaps 0x50(%[src]),%%xmm5 \n\t"
"vmovntdq %%xmm5 ,0x50(%[dst]) \n\t"
"vmovaps 0x40(%[src]),%%xmm4 \n\t"
"vmovntdq %%xmm4 ,0x40(%[dst]) \n\t"
"vmovaps 0x30(%[src]),%%xmm3 \n\t"
"vmovntdq %%xmm3 ,0x30(%[dst]) \n\t"
"vmovaps 0x20(%[src]),%%xmm2 \n\t"
"vmovntdq %%xmm2 ,0x20(%[dst]) \n\t"
"vmovaps 0x10(%[src]),%%xmm1 \n\t"
"vmovntdq %%xmm1 ,0x10(%[dst]) \n\t"
"vmovaps 0x00(%[src]),%%xmm0 \n\t"
"vmovntdq %%xmm0 ,0x00(%[dst]) \n\t"
"nop\n\t"
"nop\n\t"
: [dst]"+r" (dst),
[src]"+r" (src)
: [off]"r" (off),
[end]"r" (end)
: REG,
"xmm0",
"xmm1",
"xmm2",
"xmm3",
"xmm4",
"xmm5",
"xmm6",
"xmm7",
#ifdef __x86_64__
"xmm8",
"xmm9",
"xmm10",
"xmm11",
"xmm12",
"xmm13",
"xmm14",
"xmm15",
#endif
"memory"
);
#undef REG
//copy any remaining bytes
for(size_t i = (length & 0xF); i; --i)
((uint8_t *)dst)[length - i] =
((uint8_t *)src)[length - i];
#else
memcpy(dst, src, length);
#endif
}
-O3 -m32 -march=znver1
cmp ebx, 4
jb .L56
mov ecx, DWORD PTR [ebp+0]
lea edi, [eax+4]
mov esi, ebp
and edi, -4
mov DWORD PTR [eax], ecx
mov ecx, DWORD PTR [ebp-4+ebx]
mov DWORD PTR [eax-4+ebx], ecx
mov ecx, eax
sub ecx, edi
sub esi, ecx
add ecx, ebx
shr ecx, 2
rep movsd
jmp .L14
memcpy is only faster if: BOTH buffers, src AND dst, are 4-byte aligned. if so, memcpy() can copy a 32bit word at a time (inside its own loop over the length) if just one buffer is NOT 32bit word aligned - it creates overhead to figure out and it will do at the end a single char copy loop.
the memcpy function is used. In C++, the STL can be used (std::copy).
"memcpy is more efficient than memmove." In your case, you most probably are not doing the exact same thing while you run the two functions. In general, USE memmove only if you have to. USE it when there is a very reasonable chance that the source and destination regions are over-lapping.
memmove() is similar to memcpy() as it also copies data from a source to destination. memcpy() leads to problems when source and destination addresses overlap as memcpy() simply copies data one by one from one location to another. For example consider below program.
Could it be that the debian libc-i386 is not compiled with SSE support?... Confirmed, objdump shows no SSE used in the memcpy inlined.
GCC treats memcpy
as a built-in unless you use -fno-builtin-memcpy
; as you saw from perf
, no asm implementation in libc.so is even being called. (And gcc can't inline code out of a shared library. glibc headers only have a prototype, not an inline-asm implementation.)
Inlining memcpy as rep movs
was purely GCC's idea, with gcc -O3 -m32 -march=znver1
. (And the OP reports that -fno-builtin-memcpy
sped up this microbenchmark, so apparently glibc's hand-written asm implementation is fine. That's expected; it's probably about the same as 64-bit, and doesn't benefit from more than 8 XMM or YMM registers.)
I would highly recommend against using -fno-builtin-memcpy
in general, though, because you definitely want gcc to inline memcpy
for stuff like float foo; int32_t bar; memcpy(&foo, &bar, sizeof(foo));
. Or other small fixed-size cases where it can inline as a single vector load/store. You definitely want gcc to understand the memcpy
just copies memory, and not treat it as an opaque function.
The long-term solution is for gcc to not inline memcpy
as rep movs
on Zen; apparently that's not a good tuning decision when copies can be large. IDK if it's good for small copies; Intel has significant startup overhead.
The short-term solution is to manually call your custom memcpy (or somehow call non-builtin glibc memcpy) for copies you know are usually large, but let gcc use its builtin for other cases. The super-ugly way would be to use -fno-builtin-memcpy
and then use __builtin_memcpy
instead of memcpy
for small copies.
It looks like for large buffers, rep movs
isn't great on Ryzen compared to NT stores. On Intel, I think rep movs
is supposed to use a no-RFO protocol similar to NT stores, but maybe AMD is different.
Enhanced REP MOVSB for memcpy only mentions Intel, but it does have some details about bandwidth being limited by memory / L3 latency and max concurrency, rather than actual DRAM controller bandwidth limits.
BTW, does your custom version even check a size threshold before choosing to use NT stores? NT stores suck for small to medium buffers if the data is going to be reloaded again right away; it will have to come from DRAM instead of being an L1d hit.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With