Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is memcpy slow in 32-bit mode with gcc -march=native on Ryzen, for large buffers?

I wrote a simple test (code down the bottom) to benchmark the performance of memcpy on my 64-bit Debian system. On my system when compiled as a 64-bit binary this gives a consistent 38-40GB/s across all block sizes. However when built as a 32-bit binary on the same system the copy performance is abysmal.

I wrote my own memcpy implementation in assembler that leverages SIMD which is able to match the 64-bit performance. I am honestly shocked that my own memcpy is so much faster then the native, surely something must be wrong with the 32-bit libc build.

32 bit memcpy test results

0x00100000 B, 0.034215 ms, 29227.06 MB/s (16384 iterations)
0x00200000 B, 0.033453 ms, 29892.56 MB/s ( 8192 iterations)
0x00300000 B, 0.048710 ms, 20529.48 MB/s ( 5461 iterations)
0x00400000 B, 0.049187 ms, 20330.54 MB/s ( 4096 iterations)
0x00500000 B, 0.058945 ms, 16965.01 MB/s ( 3276 iterations)
0x00600000 B, 0.060735 ms, 16465.01 MB/s ( 2730 iterations)
0x00700000 B, 0.068973 ms, 14498.34 MB/s ( 2340 iterations)
0x00800000 B, 0.078325 ms, 12767.34 MB/s ( 2048 iterations)
0x00900000 B, 0.099801 ms, 10019.92 MB/s ( 1820 iterations)
0x00a00000 B, 0.111160 ms,  8996.04 MB/s ( 1638 iterations)
0x00b00000 B, 0.120044 ms,  8330.31 MB/s ( 1489 iterations)
0x00c00000 B, 0.116506 ms,  8583.26 MB/s ( 1365 iterations)
0x00d00000 B, 0.120322 ms,  8311.06 MB/s ( 1260 iterations)
0x00e00000 B, 0.114424 ms,  8739.40 MB/s ( 1170 iterations)
0x00f00000 B, 0.128843 ms,  7761.37 MB/s ( 1092 iterations)
0x01000000 B, 0.118122 ms,  8465.85 MB/s ( 1024 iterations)
0x08000000 B, 0.140218 ms,  7131.76 MB/s (  128 iterations)
0x10000000 B, 0.115596 ms,  8650.85 MB/s (   64 iterations)
0x20000000 B, 0.115325 ms,  8671.16 MB/s (   32 iterations)

64 bit memcpy test results

0x00100000 B, 0.022237 ms, 44970.48 MB/s (16384 iterations)
0x00200000 B, 0.022293 ms, 44856.77 MB/s ( 8192 iterations)
0x00300000 B, 0.021729 ms, 46022.49 MB/s ( 5461 iterations)
0x00400000 B, 0.028348 ms, 35275.28 MB/s ( 4096 iterations)
0x00500000 B, 0.026118 ms, 38288.08 MB/s ( 3276 iterations)
0x00600000 B, 0.026161 ms, 38225.15 MB/s ( 2730 iterations)
0x00700000 B, 0.026199 ms, 38169.68 MB/s ( 2340 iterations)
0x00800000 B, 0.026236 ms, 38116.22 MB/s ( 2048 iterations)
0x00900000 B, 0.026090 ms, 38329.50 MB/s ( 1820 iterations)
0x00a00000 B, 0.026085 ms, 38336.39 MB/s ( 1638 iterations)
0x00b00000 B, 0.026079 ms, 38345.59 MB/s ( 1489 iterations)
0x00c00000 B, 0.026147 ms, 38245.75 MB/s ( 1365 iterations)
0x00d00000 B, 0.026033 ms, 38412.69 MB/s ( 1260 iterations)
0x00e00000 B, 0.026037 ms, 38407.40 MB/s ( 1170 iterations)
0x00f00000 B, 0.026019 ms, 38433.80 MB/s ( 1092 iterations)
0x01000000 B, 0.026041 ms, 38401.61 MB/s ( 1024 iterations)
0x08000000 B, 0.026123 ms, 38280.89 MB/s (  128 iterations)
0x10000000 B, 0.026083 ms, 38338.70 MB/s (   64 iterations)
0x20000000 B, 0.026116 ms, 38290.93 MB/s (   32 iterations)

custom 32 bit memcpy

0x00100000 B, 0.026807 ms, 37303.21 MB/s (16384 iterations)
0x00200000 B, 0.026500 ms, 37735.59 MB/s ( 8192 iterations)
0x00300000 B, 0.026810 ms, 37300.04 MB/s ( 5461 iterations)
0x00400000 B, 0.026214 ms, 38148.05 MB/s ( 4096 iterations)
0x00500000 B, 0.026738 ms, 37399.74 MB/s ( 3276 iterations)
0x00600000 B, 0.026035 ms, 38409.15 MB/s ( 2730 iterations)
0x00700000 B, 0.026262 ms, 38077.29 MB/s ( 2340 iterations)
0x00800000 B, 0.026190 ms, 38183.00 MB/s ( 2048 iterations)
0x00900000 B, 0.026287 ms, 38042.18 MB/s ( 1820 iterations)
0x00a00000 B, 0.026263 ms, 38076.66 MB/s ( 1638 iterations)
0x00b00000 B, 0.026162 ms, 38223.48 MB/s ( 1489 iterations)
0x00c00000 B, 0.026189 ms, 38183.45 MB/s ( 1365 iterations)
0x00d00000 B, 0.026012 ms, 38444.52 MB/s ( 1260 iterations)
0x00e00000 B, 0.026089 ms, 38330.05 MB/s ( 1170 iterations)
0x00f00000 B, 0.026373 ms, 37917.10 MB/s ( 1092 iterations)
0x01000000 B, 0.026304 ms, 38016.85 MB/s ( 1024 iterations)
0x08000000 B, 0.025958 ms, 38523.59 MB/s (  128 iterations)
0x10000000 B, 0.025992 ms, 38473.84 MB/s (   64 iterations)
0x20000000 B, 0.026020 ms, 38431.96 MB/s (   32 iterations)

Test Program

(compile with: gcc -m32 -march=native -O3)

#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <stdint.h>
#include <malloc.h>

static inline uint64_t nanotime()
{
  struct timespec time;
  clock_gettime(CLOCK_MONOTONIC_RAW, &time);
  return ((uint64_t)time.tv_sec * 1e9) + time.tv_nsec;
}

void test(const int size)
{
  char * buffer1 = memalign(128, size);
  char * buffer2 = memalign(128, size);

  for(int i = 0; i < size; ++i)
    buffer2[i] = i;

  uint64_t t           = nanotime();
  const uint64_t loops = (16384LL * 1048576LL) / size;
  for(uint64_t i = 0; i < loops; ++i)
    memcpy(buffer1, buffer2, size);
  double ms = (((float)(nanotime() - t) / loops) / 1000000.0f) / (size / 1024 / 1024);
  printf("0x%08x B, %8.6f ms, %8.2f MB/s (%5llu iterations)\n", size, ms, 1000.0 / ms, loops);

  // prevent the compiler from trying to optimize out the copy
  if (buffer1[0] == 0x0)
    return;

  free(buffer1);
  free(buffer2);
}

int main(int argc, char * argv[])
{
  for(int i = 0; i < 16; ++i)
    test((i+1) * 1024 * 1024);

  test(128 * 1024 * 1024);
  test(256 * 1024 * 1024);
  test(512 * 1024 * 1024);
  return 0;
}

Edit

  • Tested on a Ryzen 7 and a ThreadRipper 1950x
  • glibc: 2.27
  • gcc: 7.3.0

perf results:

  99.68%  x32.n.bin  x32.n.bin          [.] test
   0.28%  x32.n.bin  [kernel.kallsyms]  [k] clear_page_rep
   0.01%  x32.n.bin  [kernel.kallsyms]  [k] get_page_from_freelist
   0.01%  x32.n.bin  [kernel.kallsyms]  [k] __mod_node_page_state
   0.01%  x32.n.bin  [kernel.kallsyms]  [k] page_fault
   0.00%  x32.n.bin  [kernel.kallsyms]  [k] default_send_IPI_single
   0.00%  perf_4.17  [kernel.kallsyms]  [k] __x86_indirect_thunk_r14

custom SSE implementation

inline static void memcpySSE(void *dst, const void * src, size_t length)
{
#if (defined(__x86_64__) || defined(__i386__))
  if (length == 0 || dst == src)
    return;

#ifdef __x86_64__
  const void * end = dst + (length & ~0xFF);
  size_t off = (15 - ((length & 0xFF) >> 4));
  off = (off < 8) ? off * 16 : 7 * 16 + (off - 7) * 10;
#else
  const void * end = dst + (length & ~0x7F);
  const size_t off = (7 - ((length & 0x7F) >> 4)) * 10;
#endif

#ifdef __x86_64__
  #define REG "rax"
#else
  #define REG "eax"
#endif

  __asm__ __volatile__ (
   "cmp         %[dst],%[end] \n\t"
   "je          Remain_%= \n\t"

   // perform SIMD block copy
   "loop_%=: \n\t"
   "vmovaps     0x00(%[src]),%%xmm0  \n\t"
   "vmovaps     0x10(%[src]),%%xmm1  \n\t"
   "vmovaps     0x20(%[src]),%%xmm2  \n\t"
   "vmovaps     0x30(%[src]),%%xmm3  \n\t"
   "vmovaps     0x40(%[src]),%%xmm4  \n\t"
   "vmovaps     0x50(%[src]),%%xmm5  \n\t"
   "vmovaps     0x60(%[src]),%%xmm6  \n\t"
   "vmovaps     0x70(%[src]),%%xmm7  \n\t"
#ifdef __x86_64__
   "vmovaps     0x80(%[src]),%%xmm8  \n\t"
   "vmovaps     0x90(%[src]),%%xmm9  \n\t"
   "vmovaps     0xA0(%[src]),%%xmm10 \n\t"
   "vmovaps     0xB0(%[src]),%%xmm11 \n\t"
   "vmovaps     0xC0(%[src]),%%xmm12 \n\t"
   "vmovaps     0xD0(%[src]),%%xmm13 \n\t"
   "vmovaps     0xE0(%[src]),%%xmm14 \n\t"
   "vmovaps     0xF0(%[src]),%%xmm15 \n\t"
#endif
   "vmovntdq    %%xmm0 ,0x00(%[dst]) \n\t"
   "vmovntdq    %%xmm1 ,0x10(%[dst]) \n\t"
   "vmovntdq    %%xmm2 ,0x20(%[dst]) \n\t"
   "vmovntdq    %%xmm3 ,0x30(%[dst]) \n\t"
   "vmovntdq    %%xmm4 ,0x40(%[dst]) \n\t"
   "vmovntdq    %%xmm5 ,0x50(%[dst]) \n\t"
   "vmovntdq    %%xmm6 ,0x60(%[dst]) \n\t"
   "vmovntdq    %%xmm7 ,0x70(%[dst]) \n\t"
#ifdef __x86_64__
   "vmovntdq    %%xmm8 ,0x80(%[dst]) \n\t"
   "vmovntdq    %%xmm9 ,0x90(%[dst]) \n\t"
   "vmovntdq    %%xmm10,0xA0(%[dst]) \n\t"
   "vmovntdq    %%xmm11,0xB0(%[dst]) \n\t"
   "vmovntdq    %%xmm12,0xC0(%[dst]) \n\t"
   "vmovntdq    %%xmm13,0xD0(%[dst]) \n\t"
   "vmovntdq    %%xmm14,0xE0(%[dst]) \n\t"
   "vmovntdq    %%xmm15,0xF0(%[dst]) \n\t"

   "add         $0x100,%[dst] \n\t"
   "add         $0x100,%[src] \n\t"
#else
   "add         $0x80,%[dst] \n\t"
   "add         $0x80,%[src] \n\t"
#endif
   "cmp         %[dst],%[end] \n\t"
   "jne         loop_%= \n\t"

   "Remain_%=: \n\t"

   // copy any remaining 16 byte blocks
#ifdef __x86_64__
   "leaq        (%%rip), %%rax\n\t"
#else
   "call        GetPC_%=\n\t"
#endif
   "Offset_%=:\n\t"
   "add         $(BlockTable_%= - Offset_%=), %%" REG "\n\t"
   "add         %[off],%%" REG " \n\t"
   "jmp         *%%" REG " \n\t"

#ifdef __i386__
  "GetPC_%=:\n\t"
  "mov (%%esp), %%eax \n\t"
  "ret \n\t"
#endif

   "BlockTable_%=:\n\t"
#ifdef __x86_64__
   "vmovaps     0xE0(%[src]),%%xmm14 \n\t"
   "vmovntdq    %%xmm14,0xE0(%[dst]) \n\t"
   "vmovaps     0xD0(%[src]),%%xmm13 \n\t"
   "vmovntdq    %%xmm13,0xD0(%[dst]) \n\t"
   "vmovaps     0xC0(%[src]),%%xmm12 \n\t"
   "vmovntdq    %%xmm12,0xC0(%[dst]) \n\t"
   "vmovaps     0xB0(%[src]),%%xmm11 \n\t"
   "vmovntdq    %%xmm11,0xB0(%[dst]) \n\t"
   "vmovaps     0xA0(%[src]),%%xmm10 \n\t"
   "vmovntdq    %%xmm10,0xA0(%[dst]) \n\t"
   "vmovaps     0x90(%[src]),%%xmm9  \n\t"
   "vmovntdq    %%xmm9 ,0x90(%[dst]) \n\t"
   "vmovaps     0x80(%[src]),%%xmm8  \n\t"
   "vmovntdq    %%xmm8 ,0x80(%[dst]) \n\t"
   "vmovaps     0x70(%[src]),%%xmm7  \n\t"
   "vmovntdq    %%xmm7 ,0x70(%[dst]) \n\t"
#endif
   "vmovaps     0x60(%[src]),%%xmm6  \n\t"
   "vmovntdq    %%xmm6 ,0x60(%[dst]) \n\t"
   "vmovaps     0x50(%[src]),%%xmm5  \n\t"
   "vmovntdq    %%xmm5 ,0x50(%[dst]) \n\t"
   "vmovaps     0x40(%[src]),%%xmm4  \n\t"
   "vmovntdq    %%xmm4 ,0x40(%[dst]) \n\t"
   "vmovaps     0x30(%[src]),%%xmm3  \n\t"
   "vmovntdq    %%xmm3 ,0x30(%[dst]) \n\t"
   "vmovaps     0x20(%[src]),%%xmm2  \n\t"
   "vmovntdq    %%xmm2 ,0x20(%[dst]) \n\t"
   "vmovaps     0x10(%[src]),%%xmm1  \n\t"
   "vmovntdq    %%xmm1 ,0x10(%[dst]) \n\t"
   "vmovaps     0x00(%[src]),%%xmm0  \n\t"
   "vmovntdq    %%xmm0 ,0x00(%[dst]) \n\t"
   "nop\n\t"
   "nop\n\t"

   : [dst]"+r" (dst),
     [src]"+r" (src)
   : [off]"r"  (off),
     [end]"r"  (end)
   : REG,
     "xmm0",
     "xmm1",
     "xmm2",
     "xmm3",
     "xmm4",
     "xmm5",
     "xmm6",
     "xmm7",
#ifdef __x86_64__
     "xmm8",
     "xmm9",
     "xmm10",
     "xmm11",
     "xmm12",
     "xmm13",
     "xmm14",
     "xmm15",
#endif
     "memory"
  );

#undef REG

  //copy any remaining bytes
  for(size_t i = (length & 0xF); i; --i)
    ((uint8_t *)dst)[length - i] =
      ((uint8_t *)src)[length - i];
#else
  memcpy(dst, src, length);
#endif
}

native memcpy with -O3 -m32 -march=znver1

  cmp ebx, 4
  jb .L56
  mov ecx, DWORD PTR [ebp+0]
  lea edi, [eax+4]
  mov esi, ebp
  and edi, -4
  mov DWORD PTR [eax], ecx
  mov ecx, DWORD PTR [ebp-4+ebx]
  mov DWORD PTR [eax-4+ebx], ecx
  mov ecx, eax
  sub ecx, edi
  sub esi, ecx
  add ecx, ebx
  shr ecx, 2
  rep movsd
  jmp .L14
like image 268
Geoffrey Avatar asked May 19 '18 06:05

Geoffrey


People also ask

How can I make memcpy faster?

memcpy is only faster if: BOTH buffers, src AND dst, are 4-byte aligned. if so, memcpy() can copy a 32bit word at a time (inside its own loop over the length) if just one buffer is NOT 32bit word aligned - it creates overhead to figure out and it will do at the end a single char copy loop.

What can I use instead of memcpy in C++?

the memcpy function is used. In C++, the STL can be used (std::copy).

Is memcpy faster than Memmove?

"memcpy is more efficient than memmove." In your case, you most probably are not doing the exact same thing while you run the two functions. In general, USE memmove only if you have to. USE it when there is a very reasonable chance that the source and destination regions are over-lapping.

What is difference between memcpy and Memmove in C?

memmove() is similar to memcpy() as it also copies data from a source to destination. memcpy() leads to problems when source and destination addresses overlap as memcpy() simply copies data one by one from one location to another. For example consider below program.


1 Answers

Could it be that the debian libc-i386 is not compiled with SSE support?... Confirmed, objdump shows no SSE used in the memcpy inlined.

GCC treats memcpy as a built-in unless you use -fno-builtin-memcpy; as you saw from perf, no asm implementation in libc.so is even being called. (And gcc can't inline code out of a shared library. glibc headers only have a prototype, not an inline-asm implementation.)

Inlining memcpy as rep movs was purely GCC's idea, with gcc -O3 -m32 -march=znver1. (And the OP reports that -fno-builtin-memcpy sped up this microbenchmark, so apparently glibc's hand-written asm implementation is fine. That's expected; it's probably about the same as 64-bit, and doesn't benefit from more than 8 XMM or YMM registers.)

I would highly recommend against using -fno-builtin-memcpy in general, though, because you definitely want gcc to inline memcpy for stuff like float foo; int32_t bar; memcpy(&foo, &bar, sizeof(foo));. Or other small fixed-size cases where it can inline as a single vector load/store. You definitely want gcc to understand the memcpy just copies memory, and not treat it as an opaque function.

The long-term solution is for gcc to not inline memcpy as rep movs on Zen; apparently that's not a good tuning decision when copies can be large. IDK if it's good for small copies; Intel has significant startup overhead.

The short-term solution is to manually call your custom memcpy (or somehow call non-builtin glibc memcpy) for copies you know are usually large, but let gcc use its builtin for other cases. The super-ugly way would be to use -fno-builtin-memcpy and then use __builtin_memcpy instead of memcpy for small copies.


It looks like for large buffers, rep movs isn't great on Ryzen compared to NT stores. On Intel, I think rep movs is supposed to use a no-RFO protocol similar to NT stores, but maybe AMD is different.

Enhanced REP MOVSB for memcpy only mentions Intel, but it does have some details about bandwidth being limited by memory / L3 latency and max concurrency, rather than actual DRAM controller bandwidth limits.


BTW, does your custom version even check a size threshold before choosing to use NT stores? NT stores suck for small to medium buffers if the data is going to be reloaded again right away; it will have to come from DRAM instead of being an L1d hit.

like image 63
Peter Cordes Avatar answered Nov 15 '22 10:11

Peter Cordes