Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimization in Memcpy implementation

Tags:

Below is the code which I got while searching for an optimized memcpy implementation.
Here's the link

void *memcpy(void *dst, void const *src, size_t len) {
    long *plDst = (long *)dst;
    long const *plSrc = (long const *)src;

    if (!(src & 0xFFFFFFFC) && !(dst & 0xFFFFFFFC)) {
        while (len >= 4) {
            *plDst++ = *plSrc++;
            len -= 4;
        }
    }

    char *pcDst = (char *)plDst;
    char const *pcDst = (char const *)plSrc;

    while (len--) {
         *pcDst++ = *pcSrc++;
    }

    return (dst);
}

Can somebody explain the line below to me?

if (!(src & 0xFFFFFFFC) && !(dst & 0xFFFFFFFC))

Here they want to check if src and dst address are aligned to a 4 byte boundary or not. Why are they using ! though, as it will make the condition false every time?

Secondly, is there any scope for more optimization in the above code?

like image 489
Abhi Avatar asked Jul 24 '18 09:07

Abhi


People also ask

How is memcpy implemented in C?

Implementation of memcpy is not a big deal, you need to typecast the given source and destination address to char* (1 byte). After the typecasting copy the data from the source to destination one by one until n (given length). Disclaimer: Below function only to understand the working of memcpy.

Is Memmove faster than memcpy?

"memcpy is more efficient than memmove." In your case, you most probably are not doing the exact same thing while you run the two functions. In general, USE memmove only if you have to. USE it when there is a very reasonable chance that the source and destination regions are over-lapping.

What is difference between memcpy and Memmove in C?

memmove() is similar to memcpy() as it also copies data from a source to destination. memcpy() leads to problems when source and destination addresses overlap as memcpy() simply copies data one by one from one location to another. For example consider below program.


1 Answers

This article, while addressing an interesting subject, falls short of providing correct examples. The code posted is referred to as GNU's newlib source code. Both the GNU project and the newlib team will be surprised to learn of this unexpected convergence statement! the newlib is not a GNU project and most of its source code is not GPL licenced.

This optimized implementation of memcpy is non portable, sub-optimal and in many aspects incorrect.

The test if (!(src & 0xFFFFFFFC) && !(dst & 0xFFFFFFFC)) attempts to detect if the src and dst addresses are aligned on long boundaries. It is cumbersome and non portable for multiple reasons and downright incorrect as you noticed:

  • implicit conversion from void * to int is ugly and implementation defined. The pointers should be cast as (uintptr_t) for better portability.
  • 0xFFFFFFFC assumes type long is 4 bytes. This may be incorrect, and indeed type long is 8-byte long on 64-bit linux and Mac systems.
  • src & 0xFFFFFFFC is not a check for alignment and is unlikely to be 0, the intended test for alignment on 4-byte boundaries is src & 3.

Furthermore, the code fails to optimize the case where src and dst have the same alignment but are not aligned on long boundaries.

Other possible improvements include unrolling the loops, using a switch for small values of len, combining bytes read from src into longs to write to dst once its is aligned on long boundaries...

Here is an improved alternative:

#include <stdint.h>

void *memcpy(void *dst, void const *src, size_t len) {
    unsigned char *pcDst = (unsigned char *)dst;
    unsigned char const *pcSrc = (unsigned char const *)src;

    if (len >= sizeof(long) * 2
    &&  ((uintptr_t)src & (sizeof(long) - 1)) == ((uintptr_t)dst & (sizeof(long) - 1))) {
        while (((uintptr_t)pcSrc & (sizeof(long) - 1)) != 0) {
            *pcDst++ = *pcSrc++;
            len--;
        }
        long *plDst = (long *)pcDst;
        long const *plSrc = (long const *)pcSrc;
        /* manually unroll the loop */
        while (len >= sizeof(long) * 4) {
            plDst[0] = plSrc[0];
            plDst[1] = plSrc[1];
            plDst[2] = plSrc[2];
            plDst[3] = plSrc[3];
            plSrc += 4;
            plDst += 4;
            len -= sizeof(long) * 4;
        }
        while (len >= sizeof(long)) {
            *plDst++ = *plSrc++;
            len -= sizeof(long);
        }
        pcDst = (unsigned char *)plDst;
        pcSrc = (unsigned char const *)plSrc;
    }
    while (len--) {
         *pcDst++ = *pcSrc++;
    }
    return dst;
}

Note that the casts from void * are unnecessary in C, but required in C++.

Here are some important points to keep in mind when trying to optimize code for speed:

  • no trade-offs on correctness. Fast code that fails even on border cases is useless.
  • beware of portability issues: solutions that depend on type sizes and/or byte order may create portability problems.
  • benchmarking will tell what works and what does not: you must carefully time the various alternatives on a variety of test cases.
  • there is no perfect solution: faster code on one architecture can be slower on a different one including and most problematically on a future one. Benchmark on different architectures.
  • fight complexity: avoid convoluted solutions that do not bring substantial improvements, their correctness is more difficult to prove and to maintain.
  • memcpy is usually optimized in assembly or implemented as a built-in by modern compilers.
like image 108
chqrlie Avatar answered Sep 28 '22 17:09

chqrlie