Why ARM NEON not faster than plain C++?

Question

Here is a C++ code:

#define ARR_SIZE_TEST ( 8 * 1024 * 1024 )  void cpp_tst_add( unsigned* x, unsigned* y ) {     for ( register int i = 0; i < ARR_SIZE_TEST; ++i )     {         x[ i ] = x[ i ] + y[ i ];     } }

Here is a neon version:

void neon_assm_tst_add( unsigned* x, unsigned* y ) {     register unsigned i = ARR_SIZE_TEST >> 2;      __asm__ __volatile__     (         ".loop1:                            
	"          "vld1.32   {q0}, [%[x]]             
	"         "vld1.32   {q1}, [%[y]]!            
	"          "vadd.i32  q0 ,q0, q1               
	"         "vst1.32   {q0}, [%[x]]!            
	"          "subs     %[i], %[i], $1            
	"         "bne      .loop1                    
	"          : [x]"+r"(x), [y]"+r"(y), [i]"+r"(i)         :         : "memory"     ); }

Test function:

void bench_simple_types_test( ) {     unsigned* a = new unsigned [ ARR_SIZE_TEST ];     unsigned* b = new unsigned [ ARR_SIZE_TEST ];      neon_tst_add( a, b );     neon_assm_tst_add( a, b ); }

I have tested both variants and here are a report:

add, unsigned, C++       : 176 ms add, unsigned, neon asm  : 185 ms // SLOW!!!

I also tested other types:

add, float,    C++       : 571 ms add, float,    neon asm  : 184 ms // FASTER X3!

THE QUESTION: Why neon is slower with 32-bit integer types?

I used last version of GCC for Android NDK. NEON optimization flags were turned on. Here is a disassembled C++ version:

                 MOVS            R3, #0                  PUSH            {R4}   loc_8                  LDR             R4, [R0,R3]                  LDR             R2, [R1,R3]                  ADDS            R2, R4, R2                  STR             R2, [R0,R3]                  ADDS            R3, #4                  CMP.W           R3, #0x2000000                  BNE             loc_8                  POP             {R4}                  BX              LR

Here is disassembled version of neon:

                 MOV.W           R3, #0x200000 .loop1                  VLD1.32         {D0-D1}, [R0]                  VLD1.32         {D2-D3}, [R1]!                  VADD.I32        Q0, Q0, Q1                  VST1.32         {D0-D1}, [R0]!                  SUBS            R3, #1                  BNE             .loop1                  BX              LR

Here is all bench tests:

add, char,     C++       : 83  ms add, char,     neon asm  : 46  ms FASTER x2  add, short,    C++       : 114 ms add, short,    neon asm  : 92  ms FASTER x1.25  add, unsigned, C++       : 176 ms add, unsigned, neon asm  : 184 ms SLOWER!!!  add, float,    C++       : 571 ms add, float,    neon asm  : 184 ms FASTER x3  add, double,   C++       : 533 ms add, double,   neon asm  : 420 ms FASTER x1.25

THE QUESTION: Why neon is slower with 32-bit integer types?

John Ripley · Accepted Answer

The NEON pipeline on Cortex-A8 is in-order executing, and has limited hit-under-miss (no renaming), so you're limited by memory latency (as you're using more than L1/L2 cache size). Your code has immediate dependencies on the values loaded from memory, so it'll stall constantly waiting for memory. This would explain why the NEON code is slightly (by a tiny amount) slower than non-NEON.

You need to unroll the assembly loops and increase the distance between load and use, e.g:

vld1.32   {q0}, [%[x]]! vld1.32   {q1}, [%[y]]! vld1.32   {q2}, [%[x]]! vld1.32   {q3}, [%[y]]! vadd.i32  q0 ,q0, q1 vadd.i32  q2 ,q2, q3 ...

There's plenty of neon registers so you can unroll it a lot. Integer code will suffer the same issue, to a lesser extent because A8 integer has better hit-under-miss instead of stalling. The bottleneck is going to be memory bandwidth/latency for benchmarks so large compared to L1/L2 cache. You might also want to run the benchmark at smaller sizes (4KB..256KB) to see effects when data is cached entirely in L1 and/or L2.

Why ARM NEON not faster than plain C++?

Tags:

c++

simd

arm

neon

cortex-a8

Smalti

1 Answers

John Ripley

Recent Activity

Donate For Us

Why ARM NEON not faster than plain C++?

Tags:

c++

simd

arm

neon

cortex-a8

Smalti

1 Answers

John Ripley

Related questions

Recent Activity

Donate For Us