Trying to xor a huge uint32
array I decided to use NEON coprocessor.
I implemented two c
versions:
version 1:
uint32_t xor_array_ver_1(uint32_t *array, int size)
{
uint32x2_t acc = vmov_n_u32(0);
uint32_t acc1 = 0;
for (; size != 0; size -= 2) {
uint32x2_t vec;
vec = vld1_u32(array);
array += 2;
acc = veor_u32(acc, vec);
}
acc1 = vget_lane_u32(acc,0) ^ vget_lane_u32(acc,1);
return acc1;
}
version 2:
uint32_t xor_array_ver_2(uint32_t *array, int size)
{
uint32x4_t acc = vmovq_n_u32(0);
uint32_t acc1 = 0;
for (; size != 0; size -= 4) {
uint32x4_t vec;
vec = vld1q_u32(array);
array += 4;
acc = veorq_u32(acc, vec);
}
acc1 ^= vgetq_lane_u32(acc,0);
acc1 ^= vgetq_lane_u32(acc,1);
acc1 ^= vgetq_lane_u32(acc,2);
acc1 ^= vgetq_lane_u32(acc,3);
return acc1;
}
Comparing the above 2 versions to the traditional xor implementation:
for (i=0; i<arr_size; i++)
val ^= my_array[i];
I observed 2 issues:
my_array
is declared as
uint32_t my_array[BIG_LENGTH];
Most likely this will be memory bandwidth limited - once you saturate the available DRAM bandwidth, which should be quite easy to do with only one ALU operation per load, you won't get any further benefit from optimisation.
Try to combine your XOR with another operation on the same data if possible - that way you amortise the cost of the cache misses.
A lengthly answer without any code snippets.
First you should ask yourself what do I expect? Do you want to write the fastest code possible? How can you verify that? Start with for example writing some tests on what your hardware can achieve. As people pointed this will be mostly memory bandwidth limited, but then you need to know how fast your memory interface is. Figure out your platform's L1, L2 and ram capacity / performance characteristics, then you'll know what you can expect at most for different buffer sizes.
Are you using latest compiler? Following question then is, are you using tools available to you at their best? Most of the compilers do not aggressively try to optimize your code, unless you told so. Are you configuring them for your best gain? Are you enabling full optimization (gcc: -O3), vectorization (gcc: -ftree-vectorize -ftree-vectorizer-verbose=1)? Do you set right configuration flags for your platform (-mcpu -mfpu)?
Are you verifying object code generated by compiler? For such a simple loop this would be very easy and help you try many configuration options and check the code produced.
Are you checking if using restricted pointers improves the performance?
What about alignment information? (For example you don't mention in your intrinsics examples but they expect size to be a multiply of 2 or 4 and of course that with usage of quad registers can create %30 improvement.)
What also about trying alignment on cache line size?
Do you know what your hardware is capable of? For example Cortex-A9 is introduced as "Out-of-order speculative issue superscalar". Can you take advantage of having dual issue capabilities?
So the answer is somewhere between "it depends" and "you need to experiment".
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With