I have converted part of an algorithm from C to ARM Assembler (using NEON instructions), but now it is 2x slower than the original C Code. How can I improve performance?
Target is a ARM Cortex-A9.
The algorithm reads 64Bit-values from an array. From this value one byte is extracted, which is then used as the lookup-value for another table. This part is done about 10 times, and each resulting table value is XOR´d with the others and the final result written into another array.
Something like this:
result[i] = T0[ GetByte0( a[i1] ) ] ^ T1[ GetByte1( a[i2] ) ] ^ ... ^ T10[ (...) ];
In my approach i load the whole array "a" in Neon Registers and then move the right byte in an arm register, calculate the offset and then load the value from the table:
vldm.64 r0, {d0-d7} //Load 8x64Bit from the input array
vmov.u8 r12, d0[0] //Mov the first Byte from d0 into r12
add r12, r2, r12, asl #3 // r12 = base_adress + r12 << 3
vldr.64 d8, [r12] // d8 = mem[r12]
.
.
.
veor d8, d8, d9 // d8 = d8 ^ d9
veor d8, d8, d10 // d8 = d8 ^d10 ...ect.
Where r2 holds the base adress of the lookup table.
adress = Table_adress + (8* value_fromByte);
This step (except the loading at the beginning) is done like 100 times. Why is this so slow?
Also what are the differences between "vld", "vldr" and "vldm" - and which one is the fastest. How can i perform the offset calculation only within Neon registers? Thank you.
Neon isn't very capable of dealing with Lookups larger than the VTBL instruction's limits(32bytes if I remember correctly).
How's the lookup table created to start with? If it's just calculations, just let Neon do the math instead of resorting to lookups.
It will be much faster this way.
don't use
vmov.u8 r12, d0[0]
moving data from NEON register to the ARM register is the worst thing you can do.
Maybe you should see VTBL instruction ! What is you byte range 0..255 ?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With