Optimizing a NEON XOR implementation

Question

Trying to xor a huge uint32 array I decided to use NEON coprocessor.

I implemented two c versions:

version 1:

uint32_t xor_array_ver_1(uint32_t *array, int size)
{
    uint32x2_t acc = vmov_n_u32(0);
    uint32_t acc1 = 0;
    for (; size != 0; size -= 2) {
        uint32x2_t vec;
        vec = vld1_u32(array);
        array += 2;
        acc = veor_u32(acc, vec);
    }
    acc1 = vget_lane_u32(acc,0) ^ vget_lane_u32(acc,1);
    return acc1;
}

version 2:

uint32_t xor_array_ver_2(uint32_t *array, int size)
{
    uint32x4_t acc = vmovq_n_u32(0);
    uint32_t acc1 = 0;

    for (; size != 0; size -= 4) {
        uint32x4_t vec;
        vec = vld1q_u32(array);
        array += 4;
        acc = veorq_u32(acc, vec);
    }

    acc1 ^= vgetq_lane_u32(acc,0);
    acc1 ^= vgetq_lane_u32(acc,1);
    acc1 ^= vgetq_lane_u32(acc,2);
    acc1 ^= vgetq_lane_u32(acc,3);

    return acc1;
}

Comparing the above 2 versions to the traditional xor implementation:

for (i=0; i<arr_size; i++)
        val ^= my_array[i];

I observed 2 issues:

Version 1 has the same performance.
Version 2 is s bit more than 30% better.

Can I rewrite it to be even better? where my_array is declared as uint32_t my_array[BIG_LENGTH];
Is there a non-NEON way I can improve the performance of the regular xoring code? unrolling the loop doesn't give any improvement.

Paul R · Accepted Answer

Most likely this will be memory bandwidth limited - once you saturate the available DRAM bandwidth, which should be quite easy to do with only one ALU operation per load, you won't get any further benefit from optimisation.

Try to combine your XOR with another operation on the same data if possible - that way you amortise the cost of the cache misses.

auselen · Answer

A lengthly answer without any code snippets.

Hardware limits

First you should ask yourself what do I expect? Do you want to write the fastest code possible? How can you verify that? Start with for example writing some tests on what your hardware can achieve. As people pointed this will be mostly memory bandwidth limited, but then you need to know how fast your memory interface is. Figure out your platform's L1, L2 and ram capacity / performance characteristics, then you'll know what you can expect at most for different buffer sizes.

Compiler

Are you using latest compiler? Following question then is, are you using tools available to you at their best? Most of the compilers do not aggressively try to optimize your code, unless you told so. Are you configuring them for your best gain? Are you enabling full optimization (gcc: -O3), vectorization (gcc: -ftree-vectorize -ftree-vectorizer-verbose=1)? Do you set right configuration flags for your platform (-mcpu -mfpu)?

Are you verifying object code generated by compiler? For such a simple loop this would be very easy and help you try many configuration options and check the code produced.

Tweaks

Are you checking if using restricted pointers improves the performance?

What about alignment information? (For example you don't mention in your intrinsics examples but they expect size to be a multiply of 2 or 4 and of course that with usage of quad registers can create %30 improvement.)

What also about trying alignment on cache line size?

Hardware capabilities

Do you know what your hardware is capable of? For example Cortex-A9 is introduced as "Out-of-order speculative issue superscalar". Can you take advantage of having dual issue capabilities?

So the answer is somewhere between "it depends" and "you need to experiment".

Optimizing a NEON XOR implementation

Tags:

c

optimization

cpu-cache

arm

neon

0x90

2 Answers

Paul R

Hardware limits

Compiler

Tweaks

Hardware capabilities

auselen

Recent Activity

Donate For Us

Optimizing a NEON XOR implementation

Tags:

c

optimization

cpu-cache

arm

neon

0x90

2 Answers

Paul R

Hardware limits

Compiler

Tweaks

Hardware capabilities

auselen

Related questions

Recent Activity

Donate For Us