Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

neon float multiplication is slower than expected

Tags:

c++

gcc

simd

arm

neon

I have two tabs of floats. I need to multiply elements from the first tab by corresponding elements from the second tab and store the result in a third tab.

I would like to use NEON to parallelize floats multiplications: four float multiplications simultaneously instead of one.

I have expected significant acceleration but I achieved only about 20% execution time reduction. This is my code:

#include <stdlib.h>
#include <iostream>
#include <arm_neon.h>

const int n = 100; // table size

/* fill a tab with random floats */
void rand_tab(float *t) {
    for (int i = 0; i < n; i++)
        t[i] = (float)rand()/(float)RAND_MAX;
}

/* Multiply elements of two tabs and store results in third tab
 - STANDARD processing. */
void mul_tab_standard(float *t1, float *t2, float *tr) {
    for (int i = 0; i < n; i++)
         tr[i] = t1[i] * t2[i]; 
}

/* Multiply elements of two tabs and store results in third tab 
- NEON processing. */
void mul_tab_neon(float *t1, float *t2, float *tr) {
    for (int i = 0; i < n; i+=4)
        vst1q_f32(tr+i, vmulq_f32(vld1q_f32(t1+i), vld1q_f32(t2+i)));
}

int main() {
    float t1[n], t2[n], tr[n];

    /* fill tables with random values */
    srand(1); rand_tab(t1); rand_tab(t2);


    // I repeat table multiplication function 1000000 times for measuring purposes:
    for (int k=0; k < 1000000; k++)
        mul_tab_standard(t1, t2, tr);  // switch to next line for comparison:
    //mul_tab_neon(t1, t2, tr);  
    return 1;
}

I run the following command to compile: g++ -mfpu=neon -ffast-math neon_test.cpp

My CPU: ARMv7 Processor rev 0 (v7l)

Do you have any ideas how I can achieve more significant speed-up?

like image 279
tomto Avatar asked Sep 14 '12 07:09

tomto


People also ask

What happens when float is multiplied by INT?

The result of the multiplication of a float and an int is a float . Besides that, it will get promoted to double when passing to printf . You need a %a , %e , %f or %g format. The %d format is used to print int types.

Can you multiply a float by a float?

First off, you can multiply floats. The problem you have is not the multiplication itself, but the original number you've used. Multiplication can lose some precision, but here the original number you've multiplied started with lost precision.


2 Answers

Cortex-A8 and Cortex-A9 can do only two SP FP multiplications per cycle, so you may at most double the performance on those (most popular) CPUs. In practice, ARM CPUs have very low IPC, so it is preferably to unroll the loops as much as possible. If you want ultimate performance, write in assembly: gcc's code generator for ARM is nowhere as good as for x86.

I also recommend to use CPU-specific optimization options: "-O3 -mcpu=cortex-a9 -march=armv7-a -mtune=cortex-a9 -mfpu=neon -mthumb" for Cortex-A9; for Cortex-A15, Cortex-A8 and Cortex-A5 replace -mcpu=-mtune=cortex-a15/a8/a5 accordingly. gcc does not have optimizations for Qualcomm CPUs, so for Qualcomm Scorpion use Cortex-A8 parameters (and also unroll even more than you usually do), and for Qualcomm Krait try Cortex-A15 parameters (you will need a recent version of gcc which supports it).

like image 120
Marat Dukhan Avatar answered Nov 07 '22 06:11

Marat Dukhan


One shortcoming with neon intrinsics, you can't use auto increment on loads, which shows up as extra instructions with your neon implementation.

Compiled with gcc version 4.4.3 and options -c -std=c99 -mfpu=neon -O3 and dumped with objdump, this is loop part of mul_tab_neon

000000a4 <mul_tab_neon>:
  ac:   e0805003    add r5, r0, r3
  b0:   e0814003    add r4, r1, r3
  b4:   e082c003    add ip, r2, r3
  b8:   e2833010    add r3, r3, #16
  bc:   f4650a8f    vld1.32 {d16-d17}, [r5]
  c0:   f4642a8f    vld1.32 {d18-d19}, [r4]
  c4:   e3530e19    cmp r3, #400    ; 0x190
  c8:   f3400df2    vmul.f32    q8, q8, q9
  cc:   f44c0a8f    vst1.32 {d16-d17}, [ip]
  d0:   1afffff5    bne ac <mul_tab_neon+0x8>

and this is loop part of mul_tab_standard

00000000 <mul_tab_standard>:
  58:   ecf01b02    vldmia  r0!, {d17}
  5c:   ecf10b02    vldmia  r1!, {d16}
  60:   f3410db0    vmul.f32    d16, d17, d16
  64:   ece20b02    vstmia  r2!, {d16}
  68:   e1520003    cmp r2, r3
  6c:   1afffff9    bne 58 <mul_tab_standard+0x58>

As you can see in standard case, compiler creates much tighter loop.

like image 42
auselen Avatar answered Nov 07 '22 05:11

auselen