Very fast memcpy for image processing?

People also ask

Is memcpy fast?

memcpy is likely to be the fastest way you can copy bytes around in memory. If you need something faster - try figuring out a way of not copying things around, e.g. swap pointers only, not the data itself.

Is Memmove fast?

The application inserts millions of 4-byte integers into sorted arrays, and uses memmove to shift the data "to the right" in order to make space for the inserted value. My expectation was that copying memory is extremely fast, and I was surprised that so much time is spent in memmove.

Why do we use memcpy?

The function memcpy() is used to copy a memory block from one location to another. One is source and another is destination pointed by the pointer. This is declared in “string.

Courtesy of William Chan and Google. 30-70% faster than memcpy in Microsoft Visual Studio 2005.

void X_aligned_memcpy_sse2(void* dest, const void* src, const unsigned long size)
{

  __asm
  {
    mov esi, src;    //src pointer
    mov edi, dest;   //dest pointer

    mov ebx, size;   //ebx is our counter 
    shr ebx, 7;      //divide by 128 (8 * 128bit registers)


    loop_copy:
      prefetchnta 128[ESI]; //SSE2 prefetch
      prefetchnta 160[ESI];
      prefetchnta 192[ESI];
      prefetchnta 224[ESI];

      movdqa xmm0, 0[ESI]; //move data from src to registers
      movdqa xmm1, 16[ESI];
      movdqa xmm2, 32[ESI];
      movdqa xmm3, 48[ESI];
      movdqa xmm4, 64[ESI];
      movdqa xmm5, 80[ESI];
      movdqa xmm6, 96[ESI];
      movdqa xmm7, 112[ESI];

      movntdq 0[EDI], xmm0; //move data from registers to dest
      movntdq 16[EDI], xmm1;
      movntdq 32[EDI], xmm2;
      movntdq 48[EDI], xmm3;
      movntdq 64[EDI], xmm4;
      movntdq 80[EDI], xmm5;
      movntdq 96[EDI], xmm6;
      movntdq 112[EDI], xmm7;

      add esi, 128;
      add edi, 128;
      dec ebx;

      jnz loop_copy; //loop please
    loop_copy_end:
  }
}

You may be able to optimize it further depending on your exact situation and any assumptions you are able to make.

You may also want to check out the memcpy source (memcpy.asm) and strip out its special case handling. It may be possible to optimise further!

The SSE-Code posted by hapalibashi is the way to go.

If you need even more performance and don't shy away from the long and winding road of writing a device-driver: All important platforms nowadays have a DMA-controller that is capable of doing a copy-job faster and in parallel to CPU code could do.

That involves writing a driver though. No big OS that I'm aware of exposes this functionality to the user-side because of the security risks.

However, it may be worth it (if you need the performance) since no code on earth could outperform a piece of hardware that is designed to do such a job.

This question is four years old now and I'm a little surprised nobody has mentioned memory bandwidth yet. CPU-Z reports that my machine has PC3-10700 RAM. That the RAM has a peak bandwidth (aka transfer rate, throughput etc) of 10700 MBytes/sec. The CPU in my machine is an i5-2430M CPU, with peak turbo frequency of 3 GHz.

Theoretically, with an infinitely fast CPU and my RAM, memcpy could go at 5300 MBytes/sec, ie half of 10700 because memcpy has to read from and then write to RAM. (edit: As v.oddou pointed out, this is a simplistic approximation).

On the other hand, imagine we had infinitely fast RAM and a realistic CPU, what could we achieve? Let's use my 3 GHz CPU as an example. If it could do a 32-bit read and a 32-bit write each cycle, then it could transfer 3e9 * 4 = 12000 MBytes/sec. This seems easily within reach for a modern CPU. Already, we can see that the code running on the CPU isn't really the bottleneck. This is one of the reasons that modern machines have data caches.

We can measure what the CPU can really do by benchmarking memcpy when we know the data is cached. Doing this accurately is fiddly. I made a simple app that wrote random numbers into an array, memcpy'd them to another array, then checksumed the copied data. I stepped through the code in the debugger to make sure that the clever compiler had not removed the copy. Altering the size of the array alters the cache performance - small arrays fit in the cache, big ones less so. I got the following results:

40 KByte arrays: 16000 MBytes/sec
400 KByte arrays: 11000 MBytes/sec
4000 KByte arrays: 3100 MBytes/sec

Obviously, my CPU can read and write more than 32 bits per cycle, since 16000 is more than the 12000 I calculated theoretically above. This means the CPU is even less of a bottleneck than I already thought. I used Visual Studio 2005, and stepping into the standard memcpy implementation, I can see that it uses the movqda instruction on my machine. I guess this can read and write 64 bits per cycle.

The nice code hapalibashi posted achieves 4200 MBytes/sec on my machine - about 40% faster than the VS 2005 implementation. I guess it is faster because it uses the prefetch instruction to improve cache performance.

In summary, the code running on the CPU isn't the bottleneck and tuning that code will only make small improvements.

At any optimisation level of -O1 or above, GCC will use builtin definitions for functions like memcpy - with the right -march parameter (-march=pentium4 for the set of features you mention) it should generate pretty optimal architecture-specific inline code.

I'd benchmark it and see what comes out.

Related questions
                            
                                How can I wait for any/all pthreads to complete?
                            
                                Is there a way to not wait for a system() command to finish? (in c) [duplicate]
                            
                                How to write self modifying code in C?
                            
                                How to compile C programming in Windows 7? [closed]
                            
                                Type-safety in C
                            
                                Creating static Mac OS X C build
                            
                                Accessing Keys from Linux Input Device
                            
                                Does C make a difference between compiling and executing a program?
                            
                                What are "cerr" and "stderr"?
                            
                                How to find my current compiler's standard, like if it is C90, etc
                            
                                string format for intptr_t and uintptr_t
                            
                                How to find the socket connection state in C?
                            
                                "Multiple definition", "first defined here" errors
                            
                                How do you print the EXACT value of a floating point number?
                            
                                Returning pointer from a function
                            
                                Literal string initializer for a character array
                            
                                Unable to pass '#' character as a command-line argument
                            
                                Is it overkill to run the unit test with Valgrind?
                            
                                Should I free/delete char* returned by getenv()?
                            
                                What is the internal precision of numpy.float128?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Very fast memcpy for image processing?

Tags:

c

optimization

x86

assembly

memcpy

People also ask

Recent Activity

Donate For Us