Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ARM GCC bug? Uses chains of vldr instead of one vldmia…

Consider the following NEON-optimized function:

void mat44_multiply_neon(float32x4x4_t& result, const float32x4x4_t& a, const float32x4x4_t& b) {
    // Make sure "a" is mapped to registers in the d0-d15 range,
    // as requested by NEON multiply operations below:
    register float32x4_t a0 asm("q0") = a.val[0];
    register float32x4_t a1 asm("q1") = a.val[1];
    register float32x4_t a2 asm("q2") = a.val[2];
    register float32x4_t a3 asm("q3") = a.val[3];
    asm volatile (
    "\n\t# multiply two matrices...\n\t"
    "# result (%q0,%q1,%q2,%q3)  = first column of B (%q4) * first row of A (q0-q3)\n\t"
    "vmul.f32 %q0, %q4, %e8[0]\n\t"
    "vmul.f32 %q1, %q4, %e9[0]\n\t"
    "vmul.f32 %q2, %q4, %e10[0]\n\t"
    "vmul.f32 %q3, %q4, %e11[0]\n\t"
    "# result (%q0,%q1,%q2,%q3) += second column of B (%q5) * second row of A (q0-q3)\n\t"
    "vmla.f32 %q0, %q5, %e8[1]\n\t"
    "vmla.f32 %q1, %q5, %e9[1]\n\t"
    "vmla.f32 %q2, %q5, %e10[1]\n\t"
    "vmla.f32 %q3, %q5, %e11[1]\n\t"
    "# result (%q0,%q1,%q2,%q3) += third column of B (%q6) * third row of A (q0-q3)\n\t"
    "vmla.f32 %q0, %q6, %f8[0]\n\t"
    "vmla.f32 %q1, %q6, %f9[0]\n\t"
    "vmla.f32 %q2, %q6, %f10[0]\n\t"
    "vmla.f32 %q3, %q6, %f11[0]\n\t"
    "# result (%q0,%q1,%q2,%q3) += last column of B (%q7) * last row of A (q0-q3)\n\t"
    "vmla.f32 %q0, %q7, %f8[1]\n\t"
    "vmla.f32 %q1, %q7, %f9[1]\n\t"
    "vmla.f32 %q2, %q7, %f10[1]\n\t"
    "vmla.f32 %q3, %q7, %f11[1]\n\t\n\t"
    : "=&w"  (result.val[0]), "=&w"  (result.val[1]), "=&w"  (result.val[2]), "=&w" (result.val[3])
    : "w"   (b.val[0]),      "w"   (b.val[1]),      "w"   (b.val[2]),      "w"   (b.val[3]),
      "w"   (a0),            "w"   (a1),            "w"   (a2),            "w"   (a3)
    :
    );
}

Why does GCC 4.5 generate this abomination, for loading the first matrix:

vldmia  r1, {d0-d1}
vldr    d2, [r1, #16]
vldr    d3, [r1, #24]
vldr    d4, [r1, #32]
vldr    d5, [r1, #40]
vldr    d6, [r1, #48]
vldr    d7, [r1, #56]

…instead of just:

vldmia  r1, {q0-q3}

…?

options I use:

arm-none-eabi-gcc-4.5.1 -x c++ -march=armv7-a -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -O3 -ffast-math -fgcse-las -funsafe-loop-optimizations -fsee -fomit-frame-pointer -fstrict-aliasing -ftree-vectorize

Note that using the iPhoneOS-provided compiler produces the same thing:

/Developer/Platforms/iPhoneOS.platform/Developer/usr/bin/gcc-4.2 -x c++ -arch armv7 -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -O3 -ffast-math -fgcse-las -funsafe-loop-optimizations -fsee -fomit-frame-pointer -fstrict-aliasing -ftree-vectorize
like image 683
jcayzac Avatar asked Dec 24 '10 08:12

jcayzac


3 Answers

Simple answer:

The GCC compiler is currently not very good at generating ARM code. If you look close to other code you'll find out that GCC almost never arranges register that it can use multiple register loads / stores except of hard-coded places like function prolog/epilog and inline memcpy.

When it comes to the use of the Neon instructions the code becomes even worse. This has something to do with the way the NEON unit works: You can treat register pairs either as quad or double-dwords. This is (as far as I know) a unique feature of register usage within GCC supported architectures. Therefore the code generator is not generating optimal code in all instances.

Btw: While I'm at it: GCC has no idea that using the 'free' barrel-shifter feature on the Cortex-A8 has an important impact on the register scheduling, and GCC gets it mostly wrong.

like image 140
Nils Pipenbrinck Avatar answered Nov 19 '22 03:11

Nils Pipenbrinck


PPC has a similar instruction (ldmw and stmw) and on some architectures it's actually slower to execute than the equivalent series of loads/stores. Obviously you might trade that off against instruction cache space or other considerations. You should do a test on your target ARM platform to see if gcc is really "wrong".

like image 1
Ben Jackson Avatar answered Nov 19 '22 03:11

Ben Jackson


This doesn't apply to the snippet you gave, but in real NEON code splitting up vld1 into 128-bit or perhaps 256-bit blocks can lead to better performing code. That's because NEON loads and stores (and permutes) can dual-issue with other NEON instructions, but dual-issuing can only happen on the first or last cycle of a multi-cycle instruction. If aligned you can get 128-bit loads in 1 cycle and 256-bit in 2 cycles.

like image 1
Exophase Avatar answered Nov 19 '22 01:11

Exophase