I was wondering how ARM floating point performance on smartphones is compared to x86. For this purpose i wrote the following code:
#include "Linderdaum.h"
sEnvironment* Env = NULL;
volatile float af = 1.0f;
volatile float bf = 1.0f;
volatile int a = 1;
volatile int b = 1;
APPLICATION_ENTRY_POINT
{
Env = new sEnvironment();
Env->DeployDefaultEnvironment( "", "CommonMedia" );
double Start = Env->GetSeconds();
float Sum1 = 0.0f;
for ( int i = 0; i != 200000000; i++ ) { Sum1 += af + bf; }
double End = Env->GetSeconds();
Env->Logger->Log( L_DEBUG, LStr::ToStr( Sum1, 4 ) );
Env->Logger->Log( L_DEBUG, "Float: " + LStr::ToStr( End-Start, 5 ) );
Start = Env->GetSeconds();
int Sum2 = 0;
for ( int i = 0; i != 200000000; i++ ) { Sum2 += a + b; }
End = Env->GetSeconds();
Env->Logger->Log( L_DEBUG, LStr::ToStr( Sum2, 4 ) );
Env->Logger->Log( L_DEBUG, "Int: " + LStr::ToStr( End-Start, 5 ) );
Env->RequestExit();
APPLICATION_EXIT_POINT( Env );
}
APPLICATION_SHUTDOWN
{}
Here are the results for different targets and compilers.
1. Windows PC on Core i7 920.
VS 2008, debug build, Win32/x86
(Main):01:30:11.769 Float: 0.72119
(Main):01:30:12.347 Int: 0.57875
float
is slower than int
.
VS 2008, debug build, Win64/x86-64
(Main):01:43:39.468 Float: 0.72247
(Main):01:43:40.040 Int: 0.57212
VS 2008, release build, Win64/x86-64
(Main):01:39:25.844 Float: 0.21671
(Main):01:39:26.060 Int: 0.21511
VS 2008, release build, Win32/x86
(Main):01:33:27.603 Float: 0.70670
(Main):01:33:27.814 Int: 0.21130
int
is gaining the lead.
2. Samsung Galaxy S smartphone.
GCC 4.3.4, armeabi-v7a, -mfpu=vfp -mfloat-abi=softfp -O3
01-27 01:31:01.171 I/LEngine (15364): (Main):01:31:01.177 Float: 6.47994
01-27 01:31:02.257 I/LEngine (15364): (Main):01:31:02.262 Int: 1.08442
float
is seriously slower than int
.
Let's now change addition to multiplication inside the loops:
float Sum1 = 2.0f;
for ( int i = 0; i != 200000000; i++ )
{
Sum1 *= af * bf;
}
...
int Sum2 = 2;
for ( int i = 0; i != 200000000; i++ )
{
Sum2 *= a * b;
}
VS 2008, debug build, Win32/x86
(Main):02:00:39.977 Float: 0.87484
(Main):02:00:40.559 Int: 0.58221
VS 2008, debug build, Win64/x86-64
(Main):01:59:27.175 Float: 0.77970
(Main):01:59:27.739 Int: 0.56328
VS 2008, release build, Win32/x86
(Main):02:05:10.413 Float: 0.86724
(Main):02:05:10.631 Int: 0.21741
VS 2008, release build, Win64/x86-64
(Main):02:09:58.355 Float: 0.29311
(Main):02:09:58.571 Int: 0.21595
GCC 4.3.4, armeabi-v7a, -mfpu=vfp -mfloat-abi=softfp -O3
01-27 02:02:20.152 I/LEngine (15809): (Main):02:02:20.156 Float: 6.97402
01-27 02:02:22.765 I/LEngine (15809): (Main):02:02:22.769 Int: 2.61264
The question is: what am i missing (any compiler options)? Is the floating point math really slower (compared to int) on ARM devices?
-mfloat-abi=softfp
explicitly calls for emulated floating point. Check the specs of your Galaxy, and compile with hardware FP if possible.
Not all ARM CPU's support hardware floating point to begin with. The default settings of NDK's ARMEABI call for emulated FP though - it's supposed to be compatible with FP-less machines. At best, you can do some run-time branching on CPU capabilities.
These results are believable.
The Cortex-A8 core used in the Exynos 3 SoC has an unpipelined VFP implementation. I don't remember the exact numbers off the top of my head, but my recollection is that throughput for VFP add and multiply is on the order of an op every 8 cycles on that core.
The good news: that's a really old SoC, and newer ARM SoC's have stronger VFP implementations - add, sub, and multiply are fully pipelined, and throughput is much improved. Also, some (but not all) Cortex-A8 SoCs support NEON, which gives you fully-pipelined single-precision floating-point.
@Seva Alekseyev The -mfloat-abi
flag only controls how floating point values are passed to functions. Using softfp
values are pass using normal registers. Using hardfp
values are passed using FPU registers. The -mfloat-abi
flag doen't control which hardware instructions are used.
Basically softfp
is use to maintain backwards compatibility with devices that do not have a FPU. Using softfp
will result is some extra overhead for devices with FPU.
@Sergey K Comparing x86 and ARM is like comparing apples to oranges. They are two very different platforms. The primary design goal for ARM is low power not speed. You could see some performance improvement using hardfp
. There is also a 4.6 version of the compiler available. I think your results are plausible considering the architecture differences.
see http://github.com/dwelch67/stm32f4d see the float03 directory
The test compares these two functions fixed vs float
.thumb_func
.globl add
add:
mov r3,#0
loop:
add r3,r0,r1
sub r2,#1
bne loop
mov r0,r3
bx lr
.thumb_func
.globl m4add
m4add:
vmov s0,r0
vmov s1,r1
m4loop:
vadd.f32 s2,s0,s1
sub r2,#1
bne m4loop
vmov r0,s2
bx lr
The results are not too surprising, the 0x4E2C time is fixed point and 0x4E2E is float, there are a few extra instructions in the float test function that likely account for the difference:
00004E2C
00004E2C
00004E2E
00004E2E
00004E2C
00004E2E
The fpu in the stm32f4 is a limited to single precision version of the vfp found in its big brothers and sisters. You should be able to perform the above test on any armv7 with vfp hardware.
By having the __aeabi_fadd function linked in and that extra call made each time through the loop, plus the additional timing of memory accesses, possibly conversions outside or inside (vmov) the library function, etc, can add to what you are seeing. The answer of course is in the disassembly.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With