I have two arrays and I want to copy one array into the other with some stride. For example, I have
A A A A A A A A ...
B B B B B B B B ...
and I want to copy every three elements of B
to A
to obtain
B A A B A A B A ...
From the post "Is there a standard, strided version of memcpy?", it seems that there is no such a possibility in C.
However, I have experienced that, in some cases, memcpy
is faster than a for
loop based copy.
My question is; Is there any way to efficiently perform strided memory copy in C++ performing at least as a standard for
loop?
Thank you very much.
EDIT - CLARIFICATION OF THE PROBLEM
To make the problem clearer, let us denote the two arrays at hand by a
and b
. I have a function that performs the unique following for
loop
for (int i=0; i<NumElements, i++)
a_[i] = b_[i];
where both the []
's are overloaded operators (I'm using an expression templates technique) so that they can be actually mean, for example
a[3*i]=b[i];
Might be a too specific answer, but on an ARM platform that supports NEON, NEON vectorization can be used to make strided copy even faster. This could be life-saving in an environment where resources are relatively more limited, which is probably why ARM is used in that setting in the first place. A prominent example is Android where most devices still use the ARM v7a architecture which supports NEON.
The following examples demonstrate this, it is a loop to copy the semi-planar UV plane of a YUV420sp image into the planar UV plane of a YUV420p image. The sizes of the source and destination buffers are both 640*480/2
bytes. All of the examples are compiled with the g++ 4.8 inside Android NDK r9d. They are executed on a Samsung Exynos Octa 5420 processor:
Level 1: Regular
void convertUVsp2UVp(
unsigned char* __restrict srcptr,
unsigned char* __restrict dstptr,
int stride)
{
for(int i=0;i<stride;i++){
dstptr[i] = srcptr[i*2];
dstptr[i + stride] = srcptr[i*2 + 1];
}
}
Compiled with -O3
only, takes about 1.5 ms on average.
Level 2: Unrolled and squeezed a bit more with moving pointers
void convertUVsp2UVp(
unsigned char* __restrict srcptr,
unsigned char* __restrict dstptr,
int stride)
{
unsigned char* endptr = dstptr + stride;
while(dstptr<endptr){
*(dstptr + 0) = *(srcptr + 0);
*(dstptr + stride + 0) = *(srcptr + 1);
*(dstptr + 1) = *(srcptr + 2);
*(dstptr + stride + 1) = *(srcptr + 3);
*(dstptr + 2) = *(srcptr + 4);
*(dstptr + stride + 2) = *(srcptr + 5);
*(dstptr + 3) = *(srcptr + 6);
*(dstptr + stride + 3) = *(srcptr + 7);
*(dstptr + 4) = *(srcptr + 8);
*(dstptr + stride + 4) = *(srcptr + 9);
*(dstptr + 5) = *(srcptr + 10);
*(dstptr + stride + 5) = *(srcptr + 11);
*(dstptr + 6) = *(srcptr + 12);
*(dstptr + stride + 6) = *(srcptr + 13);
*(dstptr + 7) = *(srcptr + 14);
*(dstptr + stride + 7) = *(srcptr + 15);
srcptr+=16;
dstptr+=8;
}
}
Compiled with -O3
only, takes about 1.15 ms on average. This is probably as fast as it gets on a regular architecture, as per the other answer.
Level 3: Regular + GCC automatic NEON vectorization
void convertUVsp2UVp(
unsigned char* __restrict srcptr,
unsigned char* __restrict dstptr,
int stride)
{
for(int i=0;i<stride;i++){
dstptr[i] = srcptr[i*2];
dstptr[i + stride] = srcptr[i*2 + 1];
}
}
Compiled with -O3 -mfpu=neon -ftree-vectorize -ftree-vectorizer-verbose=1 -mfloat-abi=softfp
, takes about 0.6 ms on average. For reference, a memcpy
of 640*480
bytes, or double the amount of what's tested here, takes about 0.6 ms on average.
As a side note, the second code (unrolled and pointered) compiled with the NEON parameters above takes about the same amount of time, 0.6 ms.
Is there any way to efficiently perform strided memory copy in C++ performing at least as a standard for loop?
Edit 2: There is no function for strided copying in the C++ libraries.
Since strided copying is not as popular a memory copying, chip manufacturers nor language designs have specialized support for strided copying.
Assuming a standard for
loop, you may be able to gain some performance by using Loop Unrolling. Some compilers have options to unroll loops; it's not a "standard" option.
Given a standard for
loop:
#define RESULT_SIZE 72
#define SIZE_A 48
#define SIZE_B 24
unsigned int A[SIZE_A];
unsigned int B[SIZE_B];
unsigned int result[RESULT_SIZE];
unsigned int index_a = 0;
unsigned int index_b = 0;
unsigned int index_result = 0;
for (index_result = 0; index_result < RESULT_SIZE;)
{
result[index_result++] = B[index_b++];
result[index_result++] = A[index_a++];
result[index_result++] = A[index_a++];
}
Loop unrolling would repeat the contents of the "standard" for
loop:
for (index_result = 0; index_result < RESULT_SIZE;)
{
result[index_result++] = B[index_b++];
result[index_result++] = A[index_a++];
result[index_result++] = A[index_a++];
result[index_result++] = B[index_b++];
result[index_result++] = A[index_a++];
result[index_result++] = A[index_a++];
}
In the unrolled version, the number of loops has been cut in half.
The performance improvement may be negligible compared to other options. The following issues affect performance and each may have different speed improvements:
One example of parallel processing is to have one processor copy the B items to the new array and another processor copy the A items to the new array.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With