Here is a C++ code:
#define ARR_SIZE_TEST ( 8 * 1024 * 1024 ) void cpp_tst_add( unsigned* x, unsigned* y ) { for ( register int i = 0; i < ARR_SIZE_TEST; ++i ) { x[ i ] = x[ i ] + y[ i ]; } }
Here is a neon version:
void neon_assm_tst_add( unsigned* x, unsigned* y ) { register unsigned i = ARR_SIZE_TEST >> 2; __asm__ __volatile__ ( ".loop1: \n\t" "vld1.32 {q0}, [%[x]] \n\t" "vld1.32 {q1}, [%[y]]! \n\t" "vadd.i32 q0 ,q0, q1 \n\t" "vst1.32 {q0}, [%[x]]! \n\t" "subs %[i], %[i], $1 \n\t" "bne .loop1 \n\t" : [x]"+r"(x), [y]"+r"(y), [i]"+r"(i) : : "memory" ); }
Test function:
void bench_simple_types_test( ) { unsigned* a = new unsigned [ ARR_SIZE_TEST ]; unsigned* b = new unsigned [ ARR_SIZE_TEST ]; neon_tst_add( a, b ); neon_assm_tst_add( a, b ); }
I have tested both variants and here are a report:
add, unsigned, C++ : 176 ms add, unsigned, neon asm : 185 ms // SLOW!!!
I also tested other types:
add, float, C++ : 571 ms add, float, neon asm : 184 ms // FASTER X3!
THE QUESTION: Why neon is slower with 32-bit integer types?
I used last version of GCC for Android NDK. NEON optimization flags were turned on. Here is a disassembled C++ version:
MOVS R3, #0 PUSH {R4} loc_8 LDR R4, [R0,R3] LDR R2, [R1,R3] ADDS R2, R4, R2 STR R2, [R0,R3] ADDS R3, #4 CMP.W R3, #0x2000000 BNE loc_8 POP {R4} BX LR
Here is disassembled version of neon:
MOV.W R3, #0x200000 .loop1 VLD1.32 {D0-D1}, [R0] VLD1.32 {D2-D3}, [R1]! VADD.I32 Q0, Q0, Q1 VST1.32 {D0-D1}, [R0]! SUBS R3, #1 BNE .loop1 BX LR
Here is all bench tests:
add, char, C++ : 83 ms add, char, neon asm : 46 ms FASTER x2 add, short, C++ : 114 ms add, short, neon asm : 92 ms FASTER x1.25 add, unsigned, C++ : 176 ms add, unsigned, neon asm : 184 ms SLOWER!!! add, float, C++ : 571 ms add, float, neon asm : 184 ms FASTER x3 add, double, C++ : 533 ms add, double, neon asm : 420 ms FASTER x1.25
THE QUESTION: Why neon is slower with 32-bit integer types?
The NEON pipeline on Cortex-A8 is in-order executing, and has limited hit-under-miss (no renaming), so you're limited by memory latency (as you're using more than L1/L2 cache size). Your code has immediate dependencies on the values loaded from memory, so it'll stall constantly waiting for memory. This would explain why the NEON code is slightly (by a tiny amount) slower than non-NEON.
You need to unroll the assembly loops and increase the distance between load and use, e.g:
vld1.32 {q0}, [%[x]]! vld1.32 {q1}, [%[y]]! vld1.32 {q2}, [%[x]]! vld1.32 {q3}, [%[y]]! vadd.i32 q0 ,q0, q1 vadd.i32 q2 ,q2, q3 ...
There's plenty of neon registers so you can unroll it a lot. Integer code will suffer the same issue, to a lesser extent because A8 integer has better hit-under-miss instead of stalling. The bottleneck is going to be memory bandwidth/latency for benchmarks so large compared to L1/L2 cache. You might also want to run the benchmark at smaller sizes (4KB..256KB) to see effects when data is cached entirely in L1 and/or L2.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With