Below is the code which I got while searching for an optimized <code>memcpy</code> implementation. Here's the link <pre class="prettyprint"><code>void *memcpy(void *dst, void const *src, size_t len) { long *plDst = (long *)dst; long const *plSrc = (long const *)src; if (!(src & 0xFFFFFFFC) && !(dst & 0xFFFFFFFC)) { while (len >= 4) { *plDst++ = *plSrc++; len -= 4; } } char *pcDst = (char *)plDst; char const *pcDst = (char const *)plSrc; while (len--) { *pcDst++ = *pcSrc++; } return (dst); } </code></pre> Can somebody explain the line below to me? <pre class="prettyprint"><code>if (!(src & 0xFFFFFFFC) && !(dst & 0xFFFFFFFC)) </code></pre> Here they want to check if <code>src</code> and <code>dst</code> address are aligned to a <code>4 byte</code> boundary or not. Why are they using <code>!</code> though, as it will make the condition <code>false</code> every time? Secondly, is there any scope for more optimization in the above code?

This article, while addressing an interesting subject, falls short of providing correct examples. The code posted is referred to as GNU's newlib source code. Both the GNU project and the newlib team will be surprised to learn of this unexpected convergence statement! the newlib is not a GNU project and most of its source code is not GPL licenced. This optimized implementation of <code>memcpy</code> is non portable, sub-optimal and in many aspects incorrect. The test <code>if (!(src & 0xFFFFFFFC) && !(dst & 0xFFFFFFFC))</code> attempts to detect if the <code>src</code> and <code>dst</code> addresses are aligned on <code>long</code> boundaries. It is cumbersome and non portable for multiple reasons and downright incorrect as you noticed: <ul> <li>implicit conversion from <code>void *</code> to <code>int</code> is ugly and implementation defined. The pointers should be cast as <code>(uintptr_t)</code> for better portability.</li> <li> <code>0xFFFFFFFC</code> assumes type <code>long</code> is 4 bytes. This may be incorrect, and indeed type <code>long</code> is 8-byte long on 64-bit linux and Mac systems.</li> <li> <code>src & 0xFFFFFFFC</code> is not a check for alignment and is unlikely to be <code>0</code>, the intended test for alignment on 4-byte boundaries is <code>src & 3</code>.</li> </ul> Furthermore, the code fails to optimize the case where <code>src</code> and <code>dst</code> have the same alignment but are not aligned on <code>long</code> boundaries. Other possible improvements include unrolling the loops, using a switch for small values of <code>len</code>, combining bytes read from <code>src</code> into <code>long</code>s to write to <code>dst</code> once its is aligned on <code>long</code> boundaries... Here is an improved alternative: <pre class="prettyprint"><code>#include <stdint.h> void *memcpy(void *dst, void const *src, size_t len) { unsigned char *pcDst = (unsigned char *)dst; unsigned char const *pcSrc = (unsigned char const *)src; if (len >= sizeof(long) * 2 && ((uintptr_t)src & (sizeof(long) - 1)) == ((uintptr_t)dst & (sizeof(long) - 1))) { while (((uintptr_t)pcSrc & (sizeof(long) - 1)) != 0) { *pcDst++ = *pcSrc++; len--; } long *plDst = (long *)pcDst; long const *plSrc = (long const *)pcSrc; /* manually unroll the loop */ while (len >= sizeof(long) * 4) { plDst[0] = plSrc[0]; plDst[1] = plSrc[1]; plDst[2] = plSrc[2]; plDst[3] = plSrc[3]; plSrc += 4; plDst += 4; len -= sizeof(long) * 4; } while (len >= sizeof(long)) { *plDst++ = *plSrc++; len -= sizeof(long); } pcDst = (unsigned char *)plDst; pcSrc = (unsigned char const *)plSrc; } while (len--) { *pcDst++ = *pcSrc++; } return dst; } </code></pre> Note that the casts from <code>void *</code> are unnecessary in C, but required in C++. Here are some important points to keep in mind when trying to optimize code for speed: <ul> <li>no trade-offs on correctness. Fast code that fails even on border cases is useless.</li> <li>beware of portability issues: solutions that depend on type sizes and/or byte order may create portability problems.</li> <li>benchmarking will tell what works and what does not: you must carefully time the various alternatives on a variety of test cases.</li> <li>there is no perfect solution: faster code on one architecture can be slower on a different one including and most problematically on a future one. Benchmark on different architectures.</li> <li>fight complexity: avoid convoluted solutions that do not bring substantial improvements, their correctness is more difficult to prove and to maintain.</li> <li> <code>memcpy</code> is usually optimized in assembly or implemented as a built-in by modern compilers.</li> </ul>

Optimization in Memcpy implementation

Tags:

Below is the code which I got while searching for an optimized memcpy implementation.
Here's the link

void *memcpy(void *dst, void const *src, size_t len) {
    long *plDst = (long *)dst;
    long const *plSrc = (long const *)src;

    if (!(src & 0xFFFFFFFC) && !(dst & 0xFFFFFFFC)) {
        while (len >= 4) {
            *plDst++ = *plSrc++;
            len -= 4;
        }
    }

    char *pcDst = (char *)plDst;
    char const *pcDst = (char const *)plSrc;

    while (len--) {
         *pcDst++ = *pcSrc++;
    }

    return (dst);
}

Can somebody explain the line below to me?

if (!(src & 0xFFFFFFFC) && !(dst & 0xFFFFFFFC))

Here they want to check if src and dst address are aligned to a 4 byte boundary or not. Why are they using ! though, as it will make the condition false every time?

Secondly, is there any scope for more optimization in the above code?

489

asked Jul 24 '18 09:07

Abhi

1 Answers

This article, while addressing an interesting subject, falls short of providing correct examples. The code posted is referred to as GNU's newlib source code. Both the GNU project and the newlib team will be surprised to learn of this unexpected convergence statement! the newlib is not a GNU project and most of its source code is not GPL licenced.

This optimized implementation of memcpy is non portable, sub-optimal and in many aspects incorrect.

The test if (!(src & 0xFFFFFFFC) && !(dst & 0xFFFFFFFC)) attempts to detect if the src and dst addresses are aligned on long boundaries. It is cumbersome and non portable for multiple reasons and downright incorrect as you noticed:

implicit conversion from void * to int is ugly and implementation defined. The pointers should be cast as (uintptr_t) for better portability.
0xFFFFFFFC assumes type long is 4 bytes. This may be incorrect, and indeed type long is 8-byte long on 64-bit linux and Mac systems.
src & 0xFFFFFFFC is not a check for alignment and is unlikely to be 0, the intended test for alignment on 4-byte boundaries is src & 3.

Furthermore, the code fails to optimize the case where src and dst have the same alignment but are not aligned on long boundaries.

Other possible improvements include unrolling the loops, using a switch for small values of len, combining bytes read from src into longs to write to dst once its is aligned on long boundaries...

Here is an improved alternative:

#include <stdint.h>

void *memcpy(void *dst, void const *src, size_t len) {
    unsigned char *pcDst = (unsigned char *)dst;
    unsigned char const *pcSrc = (unsigned char const *)src;

    if (len >= sizeof(long) * 2
    &&  ((uintptr_t)src & (sizeof(long) - 1)) == ((uintptr_t)dst & (sizeof(long) - 1))) {
        while (((uintptr_t)pcSrc & (sizeof(long) - 1)) != 0) {
            *pcDst++ = *pcSrc++;
            len--;
        }
        long *plDst = (long *)pcDst;
        long const *plSrc = (long const *)pcSrc;
        /* manually unroll the loop */
        while (len >= sizeof(long) * 4) {
            plDst[0] = plSrc[0];
            plDst[1] = plSrc[1];
            plDst[2] = plSrc[2];
            plDst[3] = plSrc[3];
            plSrc += 4;
            plDst += 4;
            len -= sizeof(long) * 4;
        }
        while (len >= sizeof(long)) {
            *plDst++ = *plSrc++;
            len -= sizeof(long);
        }
        pcDst = (unsigned char *)plDst;
        pcSrc = (unsigned char const *)plSrc;
    }
    while (len--) {
         *pcDst++ = *pcSrc++;
    }
    return dst;
}

Note that the casts from void * are unnecessary in C, but required in C++.

Here are some important points to keep in mind when trying to optimize code for speed:

no trade-offs on correctness. Fast code that fails even on border cases is useless.
beware of portability issues: solutions that depend on type sizes and/or byte order may create portability problems.
benchmarking will tell what works and what does not: you must carefully time the various alternatives on a variety of test cases.
there is no perfect solution: faster code on one architecture can be slower on a different one including and most problematically on a future one. Benchmark on different architectures.
fight complexity: avoid convoluted solutions that do not bring substantial improvements, their correctness is more difficult to prove and to maintain.
memcpy is usually optimized in assembly or implemented as a built-in by modern compilers.

108

answered Sep 28 '22 17:09

chqrlie

Related questions
                            
                                how to print output of mutiple files side by side
                            
                                Spring Boot 2: Basic Http Auth causes unprotected endpoints to respond with 401 "Unauthorized" if Authorization header is attached
                            
                                DDPG (Deep Deterministic Policy Gradients), how is the actor updated?
                            
                                iText7 convert HTML to PDF "System.NullReferenceException."
                            
                                CSS Grid - Difference Between flex-start and Start Value in alignment/justifying Properties
                            
                                How to maximize CPU utilization in tensorflow GPU training with Dataset API?
                            
                                Edit Vader_lexicon.txt in nltk for python to add words related to my domain
                            
                                Bootstrap 4 grid | variable - fixed - variable | width column, overlaps on IE11
                            
                                Decoding OAuth2 JWT at API Gateway level vs at individual microservice level
                            
                                didSelectItemAt and didDeSelectItemAt from UICollectionView
                            
                                Regex Replace in the string with same value in multiple words
                            
                                Mongoose Querying Views

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With