Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Copying strided data in C++

Tags:

c++

stride

I have two arrays and I want to copy one array into the other with some stride. For example, I have

A A A A A A A A ...

B B B B B B B B ...

and I want to copy every three elements of B to A to obtain

B A A B A A B A ...

From the post "Is there a standard, strided version of memcpy?", it seems that there is no such a possibility in C.

However, I have experienced that, in some cases, memcpy is faster than a for loop based copy.

My question is; Is there any way to efficiently perform strided memory copy in C++ performing at least as a standard for loop?

Thank you very much.

EDIT - CLARIFICATION OF THE PROBLEM

To make the problem clearer, let us denote the two arrays at hand by a and b. I have a function that performs the unique following for loop

for (int i=0; i<NumElements, i++)
    a_[i] = b_[i];

where both the []'s are overloaded operators (I'm using an expression templates technique) so that they can be actually mean, for example

 a[3*i]=b[i];
like image 528
Vitality Avatar asked Jun 13 '13 15:06

Vitality


Video Answer


2 Answers

Might be a too specific answer, but on an ARM platform that supports NEON, NEON vectorization can be used to make strided copy even faster. This could be life-saving in an environment where resources are relatively more limited, which is probably why ARM is used in that setting in the first place. A prominent example is Android where most devices still use the ARM v7a architecture which supports NEON.

The following examples demonstrate this, it is a loop to copy the semi-planar UV plane of a YUV420sp image into the planar UV plane of a YUV420p image. The sizes of the source and destination buffers are both 640*480/2 bytes. All of the examples are compiled with the g++ 4.8 inside Android NDK r9d. They are executed on a Samsung Exynos Octa 5420 processor:

Level 1: Regular

void convertUVsp2UVp(
    unsigned char* __restrict srcptr, 
    unsigned char* __restrict dstptr, 
    int stride)
{
    for(int i=0;i<stride;i++){
        dstptr[i]           = srcptr[i*2];
        dstptr[i + stride]  = srcptr[i*2 + 1];
    }
}

Compiled with -O3 only, takes about 1.5 ms on average.

Level 2: Unrolled and squeezed a bit more with moving pointers

void convertUVsp2UVp(
    unsigned char* __restrict srcptr, 
    unsigned char* __restrict dstptr, 
    int stride)
{
    unsigned char* endptr = dstptr + stride;
    while(dstptr<endptr){
        *(dstptr + 0)             = *(srcptr + 0);
        *(dstptr + stride + 0)    = *(srcptr + 1);
        *(dstptr + 1)             = *(srcptr + 2);
        *(dstptr + stride + 1)    = *(srcptr + 3);
        *(dstptr + 2)             = *(srcptr + 4);
        *(dstptr + stride + 2)    = *(srcptr + 5);
        *(dstptr + 3)             = *(srcptr + 6);
        *(dstptr + stride + 3)    = *(srcptr + 7);
        *(dstptr + 4)             = *(srcptr + 8);
        *(dstptr + stride + 4)    = *(srcptr + 9);
        *(dstptr + 5)             = *(srcptr + 10);
        *(dstptr + stride + 5)    = *(srcptr + 11);
        *(dstptr + 6)             = *(srcptr + 12);
        *(dstptr + stride + 6)    = *(srcptr + 13);
        *(dstptr + 7)             = *(srcptr + 14);
        *(dstptr + stride + 7)    = *(srcptr + 15);
        srcptr+=16;
        dstptr+=8;
    } 
}

Compiled with -O3 only, takes about 1.15 ms on average. This is probably as fast as it gets on a regular architecture, as per the other answer.

Level 3: Regular + GCC automatic NEON vectorization

void convertUVsp2UVp(
    unsigned char* __restrict srcptr, 
    unsigned char* __restrict dstptr, 
    int stride)
{
    for(int i=0;i<stride;i++){
        dstptr[i]           = srcptr[i*2];
        dstptr[i + stride]  = srcptr[i*2 + 1];
    }
}

Compiled with -O3 -mfpu=neon -ftree-vectorize -ftree-vectorizer-verbose=1 -mfloat-abi=softfp, takes about 0.6 ms on average. For reference, a memcpy of 640*480 bytes, or double the amount of what's tested here, takes about 0.6 ms on average.

As a side note, the second code (unrolled and pointered) compiled with the NEON parameters above takes about the same amount of time, 0.6 ms.

like image 200
Ayberk Özgür Avatar answered Oct 23 '22 17:10

Ayberk Özgür


Is there any way to efficiently perform strided memory copy in C++ performing at least as a standard for loop?

Edit 2: There is no function for strided copying in the C++ libraries.

Since strided copying is not as popular a memory copying, chip manufacturers nor language designs have specialized support for strided copying.

Assuming a standard for loop, you may be able to gain some performance by using Loop Unrolling. Some compilers have options to unroll loops; it's not a "standard" option.

Given a standard for loop:

#define RESULT_SIZE 72
#define SIZE_A 48
#define SIZE_B 24

unsigned int A[SIZE_A];
unsigned int B[SIZE_B];
unsigned int result[RESULT_SIZE];

unsigned int index_a = 0;
unsigned int index_b = 0;
unsigned int index_result = 0;
for (index_result = 0; index_result < RESULT_SIZE;)
{
   result[index_result++] = B[index_b++];
   result[index_result++] = A[index_a++];
   result[index_result++] = A[index_a++]; 
}

Loop unrolling would repeat the contents of the "standard" for loop:

for (index_result = 0; index_result < RESULT_SIZE;)
{
   result[index_result++] = B[index_b++];
   result[index_result++] = A[index_a++];
   result[index_result++] = A[index_a++]; 

   result[index_result++] = B[index_b++];
   result[index_result++] = A[index_a++];
   result[index_result++] = A[index_a++]; 
}

In the unrolled version, the number of loops has been cut in half.

The performance improvement may be negligible compared to other options. The following issues affect performance and each may have different speed improvements:

  • Processing data cache misses
  • Reloading of instruction pipeline (depends on processor)
  • Operating System swapping memory with disk
  • Other tasks running concurrently
  • Parallel processing (depends on processor / platform)

One example of parallel processing is to have one processor copy the B items to the new array and another processor copy the A items to the new array.

like image 8
Thomas Matthews Avatar answered Oct 23 '22 16:10

Thomas Matthews