I have a function that is doing memcpy, but it's taking up an enormous amount of cycles. Is there a faster alternative/approach than using memcpy to move a piece of memory?
memmove() is similar to memcpy() as it also copies data from a source to destination.
"memcpy is more efficient than memmove." In your case, you most probably are not doing the exact same thing while you run the two functions. In general, USE memmove only if you have to. USE it when there is a very reasonable chance that the source and destination regions are over-lapping.
If you know the length of a string, you can use mem functions instead of str functions. For example, memcpy is faster than strcpy because it does not have to search for the end of the string. If you are certain that the source and target do not overlap, use memcpy instead of memmove .
I also had some code that I really needed to speed up, and memcpy is slow because it has too many unnecessary checks. For example, it checks to see if the destination and source memory blocks overlap and if it should start copying from the back of the block rather than the front.
memcpy is likely to be the fastest way you can copy bytes around in memory. If you need something faster - try figuring out a way of not copying things around, e.g. swap pointers only, not the data itself.
This is an answer for x86_64 with AVX2 instruction set present. Though something similar may apply for ARM/AArch64 with SIMD.
On Ryzen 1800X with single memory channel filled completely (2 slots, 16 GB DDR4 in each), the following code is 1.56 times faster than memcpy() on MSVC++2017 compiler. If you fill both memory channels with 2 DDR4 modules, i.e. you have all 4 DDR4 slots busy, you may get further 2 times faster memory copying. For triple-(quad-)channel memory systems, you can get further 1.5(2.0) times faster memory copying if the code is extended to analogous AVX512 code. With AVX2-only triple/quad channel systems with all slots busy are not expected to be faster because to load them fully you need to load/store more than 32 bytes at once (48 bytes for triple- and 64-bytes for quad-channel systems), while AVX2 can load/store no more than 32 bytes at once. Though multithreading on some systems can alleviate this without AVX512 or even AVX2.
So here is the copy code that assumes you are copying a large block of memory whose size is a multiple of 32 and the block is 32-byte aligned.
For non-multiple size and non-aligned blocks, prologue/epilogue code can be written reducing the width to 16 (SSE4.1), 8, 4, 2 and finally 1 byte at once for the block head and tail. Also in the middle a local array of 2-3 __m256i values can be used as a proxy between aligned reads from the source and aligned writes to the destination.
#include <immintrin.h> #include <cstdint> /* ... */ void fastMemcpy(void *pvDest, void *pvSrc, size_t nBytes) {   assert(nBytes % 32 == 0);   assert((intptr_t(pvDest) & 31) == 0);   assert((intptr_t(pvSrc) & 31) == 0);   const __m256i *pSrc = reinterpret_cast<const __m256i*>(pvSrc);   __m256i *pDest = reinterpret_cast<__m256i*>(pvDest);   int64_t nVects = nBytes / sizeof(*pSrc);   for (; nVects > 0; nVects--, pSrc++, pDest++) {     const __m256i loaded = _mm256_stream_load_si256(pSrc);     _mm256_stream_si256(pDest, loaded);   }   _mm_sfence(); } A key feature of this code is that it skips CPU cache when copying: when CPU cache is involved (i.e. AVX instructions without _stream_ are used), the copy speed drops several times on my system.
My DDR4 memory is 2.6GHz CL13 . So when copying 8GB of data from one array to another I got the following speeds:
memcpy(): 17,208,004,271 bytes/sec. Stream copy: 26,842,874,528 bytes/sec. Note that in these measurements the total size of both input and output buffers is divided by the number of seconds elapsed. Because for each byte of the array there are 2 memory accesses: one to read the byte from the input array, another to write the byte to the output array. In the other words, when copying 8GB from one array to another, you do 16GB worth of memory access operations.
Moderate multithreading can further improve performance about 1.44 times, so total increase over memcpy() reaches 2.55 times on my machine. Here's how stream copy performance depends on the number of threads used on my machine:
Stream copy 1 threads: 27114820909.821 bytes/sec Stream copy 2 threads: 37093291383.193 bytes/sec Stream copy 3 threads: 39133652655.437 bytes/sec Stream copy 4 threads: 39087442742.603 bytes/sec Stream copy 5 threads: 39184708231.360 bytes/sec Stream copy 6 threads: 38294071248.022 bytes/sec Stream copy 7 threads: 38015877356.925 bytes/sec Stream copy 8 threads: 38049387471.070 bytes/sec Stream copy 9 threads: 38044753158.979 bytes/sec Stream copy 10 threads: 37261031309.915 bytes/sec Stream copy 11 threads: 35868511432.914 bytes/sec Stream copy 12 threads: 36124795895.452 bytes/sec Stream copy 13 threads: 36321153287.851 bytes/sec Stream copy 14 threads: 36211294266.431 bytes/sec Stream copy 15 threads: 35032645421.251 bytes/sec Stream copy 16 threads: 33590712593.876 bytes/sec The code is:
void AsyncStreamCopy(__m256i *pDest, const __m256i *pSrc, int64_t nVects) {   for (; nVects > 0; nVects--, pSrc++, pDest++) {     const __m256i loaded = _mm256_stream_load_si256(pSrc);     _mm256_stream_si256(pDest, loaded);   } }  void BenchmarkMultithreadStreamCopy(double *gpdOutput, const double *gpdInput, const int64_t cnDoubles) {   assert((cnDoubles * sizeof(double)) % sizeof(__m256i) == 0);   const uint32_t maxThreads = std::thread::hardware_concurrency();   std::vector<std::thread> thrs;   thrs.reserve(maxThreads + 1);    const __m256i *pSrc = reinterpret_cast<const __m256i*>(gpdInput);   __m256i *pDest = reinterpret_cast<__m256i*>(gpdOutput);   const int64_t nVects = cnDoubles * sizeof(*gpdInput) / sizeof(*pSrc);    for (uint32_t nThreads = 1; nThreads <= maxThreads; nThreads++) {     auto start = std::chrono::high_resolution_clock::now();     lldiv_t perWorker = div((long long)nVects, (long long)nThreads);     int64_t nextStart = 0;     for (uint32_t i = 0; i < nThreads; i++) {       const int64_t curStart = nextStart;       nextStart += perWorker.quot;       if ((long long)i < perWorker.rem) {         nextStart++;       }       thrs.emplace_back(AsyncStreamCopy, pDest + curStart, pSrc+curStart, nextStart-curStart);     }     for (uint32_t i = 0; i < nThreads; i++) {       thrs[i].join();     }     _mm_sfence();     auto elapsed = std::chrono::high_resolution_clock::now() - start;     double nSec = 1e-6 * std::chrono::duration_cast<std::chrono::microseconds>(elapsed).count();     printf("Stream copy %d threads: %.3lf bytes/sec\n", (int)nThreads, cnDoubles * 2 * sizeof(double) / nSec);      thrs.clear();   } } If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With