I have two tabs of floats. I need to multiply elements from the first tab by corresponding elements from the second tab and store the result in a third tab.
I would like to use NEON to parallelize floats multiplications: four float multiplications simultaneously instead of one.
I have expected significant acceleration but I achieved only about 20% execution time reduction. This is my code:
#include <stdlib.h>
#include <iostream>
#include <arm_neon.h>
const int n = 100; // table size
/* fill a tab with random floats */
void rand_tab(float *t) {
for (int i = 0; i < n; i++)
t[i] = (float)rand()/(float)RAND_MAX;
}
/* Multiply elements of two tabs and store results in third tab
- STANDARD processing. */
void mul_tab_standard(float *t1, float *t2, float *tr) {
for (int i = 0; i < n; i++)
tr[i] = t1[i] * t2[i];
}
/* Multiply elements of two tabs and store results in third tab
- NEON processing. */
void mul_tab_neon(float *t1, float *t2, float *tr) {
for (int i = 0; i < n; i+=4)
vst1q_f32(tr+i, vmulq_f32(vld1q_f32(t1+i), vld1q_f32(t2+i)));
}
int main() {
float t1[n], t2[n], tr[n];
/* fill tables with random values */
srand(1); rand_tab(t1); rand_tab(t2);
// I repeat table multiplication function 1000000 times for measuring purposes:
for (int k=0; k < 1000000; k++)
mul_tab_standard(t1, t2, tr); // switch to next line for comparison:
//mul_tab_neon(t1, t2, tr);
return 1;
}
I run the following command to compile: g++ -mfpu=neon -ffast-math neon_test.cpp
My CPU: ARMv7 Processor rev 0 (v7l)
Do you have any ideas how I can achieve more significant speed-up?
The result of the multiplication of a float and an int is a float . Besides that, it will get promoted to double when passing to printf . You need a %a , %e , %f or %g format. The %d format is used to print int types.
First off, you can multiply floats. The problem you have is not the multiplication itself, but the original number you've used. Multiplication can lose some precision, but here the original number you've multiplied started with lost precision.
Cortex-A8 and Cortex-A9 can do only two SP FP multiplications per cycle, so you may at most double the performance on those (most popular) CPUs. In practice, ARM CPUs have very low IPC, so it is preferably to unroll the loops as much as possible. If you want ultimate performance, write in assembly: gcc's code generator for ARM is nowhere as good as for x86.
I also recommend to use CPU-specific optimization options: "-O3 -mcpu=cortex-a9 -march=armv7-a -mtune=cortex-a9 -mfpu=neon -mthumb" for Cortex-A9; for Cortex-A15, Cortex-A8 and Cortex-A5 replace -mcpu=-mtune=cortex-a15/a8/a5 accordingly. gcc does not have optimizations for Qualcomm CPUs, so for Qualcomm Scorpion use Cortex-A8 parameters (and also unroll even more than you usually do), and for Qualcomm Krait try Cortex-A15 parameters (you will need a recent version of gcc which supports it).
One shortcoming with neon intrinsics, you can't use auto increment on loads, which shows up as extra instructions with your neon implementation.
Compiled with gcc version 4.4.3 and options -c -std=c99 -mfpu=neon -O3 and dumped with objdump, this is loop part of mul_tab_neon
000000a4 <mul_tab_neon>:
ac: e0805003 add r5, r0, r3
b0: e0814003 add r4, r1, r3
b4: e082c003 add ip, r2, r3
b8: e2833010 add r3, r3, #16
bc: f4650a8f vld1.32 {d16-d17}, [r5]
c0: f4642a8f vld1.32 {d18-d19}, [r4]
c4: e3530e19 cmp r3, #400 ; 0x190
c8: f3400df2 vmul.f32 q8, q8, q9
cc: f44c0a8f vst1.32 {d16-d17}, [ip]
d0: 1afffff5 bne ac <mul_tab_neon+0x8>
and this is loop part of mul_tab_standard
00000000 <mul_tab_standard>:
58: ecf01b02 vldmia r0!, {d17}
5c: ecf10b02 vldmia r1!, {d16}
60: f3410db0 vmul.f32 d16, d17, d16
64: ece20b02 vstmia r2!, {d16}
68: e1520003 cmp r2, r3
6c: 1afffff9 bne 58 <mul_tab_standard+0x58>
As you can see in standard case, compiler creates much tighter loop.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With