Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does float VS int performance on x86 and ARM differ so much?

I was wondering how ARM floating point performance on smartphones is compared to x86. For this purpose i wrote the following code:

#include "Linderdaum.h"
sEnvironment* Env = NULL;

volatile float af = 1.0f;
volatile float bf = 1.0f;
volatile int a = 1;
volatile int b = 1;

APPLICATION_ENTRY_POINT
{
    Env = new sEnvironment();

    Env->DeployDefaultEnvironment( "", "CommonMedia" );

    double Start = Env->GetSeconds();

    float Sum1 = 0.0f;

    for ( int i = 0; i != 200000000; i++ )    {        Sum1 += af + bf;    }

    double End = Env->GetSeconds();

    Env->Logger->Log( L_DEBUG, LStr::ToStr( Sum1, 4 ) );
    Env->Logger->Log( L_DEBUG, "Float: " + LStr::ToStr( End-Start, 5 ) );

    Start = Env->GetSeconds();

    int Sum2 = 0;

    for ( int i = 0; i != 200000000; i++ )    {       Sum2 += a + b;    }

    End = Env->GetSeconds();

    Env->Logger->Log( L_DEBUG, LStr::ToStr( Sum2, 4 ) );
    Env->Logger->Log( L_DEBUG, "Int: " + LStr::ToStr( End-Start, 5 ) );

    Env->RequestExit();

    APPLICATION_EXIT_POINT( Env );
}

APPLICATION_SHUTDOWN
{}

Here are the results for different targets and compilers.

1. Windows PC on Core i7 920.

VS 2008, debug build, Win32/x86

(Main):01:30:11.769   Float: 0.72119
(Main):01:30:12.347   Int: 0.57875

float is slower than int.

VS 2008, debug build, Win64/x86-64

(Main):01:43:39.468   Float: 0.72247
(Main):01:43:40.040   Int: 0.57212

VS 2008, release build, Win64/x86-64

(Main):01:39:25.844   Float: 0.21671
(Main):01:39:26.060   Int: 0.21511

VS 2008, release build, Win32/x86

(Main):01:33:27.603   Float: 0.70670
(Main):01:33:27.814   Int: 0.21130

int is gaining the lead.

2. Samsung Galaxy S smartphone.

GCC 4.3.4, armeabi-v7a, -mfpu=vfp -mfloat-abi=softfp -O3

01-27 01:31:01.171 I/LEngine (15364): (Main):01:31:01.177   Float: 6.47994
01-27 01:31:02.257 I/LEngine (15364): (Main):01:31:02.262   Int: 1.08442

float is seriously slower than int.

Let's now change addition to multiplication inside the loops:

float Sum1 = 2.0f;

for ( int i = 0; i != 200000000; i++ )
{
    Sum1 *= af * bf;
}
...
int Sum2 = 2;

for ( int i = 0; i != 200000000; i++ )
{
    Sum2 *= a * b;
}

VS 2008, debug build, Win32/x86

(Main):02:00:39.977   Float: 0.87484
(Main):02:00:40.559   Int: 0.58221

VS 2008, debug build, Win64/x86-64

(Main):01:59:27.175   Float: 0.77970
(Main):01:59:27.739   Int: 0.56328

VS 2008, release build, Win32/x86

(Main):02:05:10.413   Float: 0.86724
(Main):02:05:10.631   Int: 0.21741

VS 2008, release build, Win64/x86-64

(Main):02:09:58.355   Float: 0.29311
(Main):02:09:58.571   Int: 0.21595

GCC 4.3.4, armeabi-v7a, -mfpu=vfp -mfloat-abi=softfp -O3

01-27 02:02:20.152 I/LEngine (15809): (Main):02:02:20.156   Float: 6.97402
01-27 02:02:22.765 I/LEngine (15809): (Main):02:02:22.769   Int: 2.61264

The question is: what am i missing (any compiler options)? Is the floating point math really slower (compared to int) on ARM devices?

like image 243
Sergey K. Avatar asked Sep 10 '12 13:09

Sergey K.


4 Answers

-mfloat-abi=softfp explicitly calls for emulated floating point. Check the specs of your Galaxy, and compile with hardware FP if possible.

Not all ARM CPU's support hardware floating point to begin with. The default settings of NDK's ARMEABI call for emulated FP though - it's supposed to be compatible with FP-less machines. At best, you can do some run-time branching on CPU capabilities.

like image 123
Seva Alekseyev Avatar answered Oct 03 '22 00:10

Seva Alekseyev


These results are believable.

The Cortex-A8 core used in the Exynos 3 SoC has an unpipelined VFP implementation. I don't remember the exact numbers off the top of my head, but my recollection is that throughput for VFP add and multiply is on the order of an op every 8 cycles on that core.

The good news: that's a really old SoC, and newer ARM SoC's have stronger VFP implementations - add, sub, and multiply are fully pipelined, and throughput is much improved. Also, some (but not all) Cortex-A8 SoCs support NEON, which gives you fully-pipelined single-precision floating-point.

like image 40
Stephen Canon Avatar answered Oct 03 '22 01:10

Stephen Canon


@Seva Alekseyev The -mfloat-abi flag only controls how floating point values are passed to functions. Using softfp values are pass using normal registers. Using hardfp values are passed using FPU registers. The -mfloat-abi flag doen't control which hardware instructions are used.

Basically softfp is use to maintain backwards compatibility with devices that do not have a FPU. Using softfp will result is some extra overhead for devices with FPU.

@Sergey K Comparing x86 and ARM is like comparing apples to oranges. They are two very different platforms. The primary design goal for ARM is low power not speed. You could see some performance improvement using hardfp. There is also a 4.6 version of the compiler available. I think your results are plausible considering the architecture differences.

like image 42
Frohnzie Avatar answered Oct 03 '22 00:10

Frohnzie


see http://github.com/dwelch67/stm32f4d see the float03 directory

The test compares these two functions fixed vs float

.thumb_func
.globl add
add:
    mov r3,#0
loop:
    add r3,r0,r1
    sub r2,#1
    bne loop
    mov r0,r3
    bx lr

.thumb_func
.globl m4add
m4add:
    vmov s0,r0
    vmov s1,r1
m4loop:
    vadd.f32 s2,s0,s1
    sub r2,#1
    bne m4loop
    vmov r0,s2
    bx lr

The results are not too surprising, the 0x4E2C time is fixed point and 0x4E2E is float, there are a few extra instructions in the float test function that likely account for the difference:

00004E2C                                                                        
00004E2C                                                                        
00004E2E                                                                        
00004E2E                                                                        
00004E2C                                                                        
00004E2E    

The fpu in the stm32f4 is a limited to single precision version of the vfp found in its big brothers and sisters. You should be able to perform the above test on any armv7 with vfp hardware.

By having the __aeabi_fadd function linked in and that extra call made each time through the loop, plus the additional timing of memory accesses, possibly conversions outside or inside (vmov) the library function, etc, can add to what you are seeing. The answer of course is in the disassembly.

like image 45
old_timer Avatar answered Oct 02 '22 23:10

old_timer