Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Inline assembly vs math library

High, could anyone help me understand why is it more efficient calling math library functions than writing inline assembly code to perform the same operation?. I wrote this simple test:

#include <stdio.h>
#define __USE_GNU
#include <math.h>

void main( void ){
    float ang;
    int i;

    for( i = 0; i< 1000000; i++){
        ang = M_PI_2 * i/2000000;
    /*__asm__ ( "fld %0;"
              "fptan;"
              "fxch;"
              "fstp %0;" : "=m" (ang) : "m" (ang)
    ) ;*/
    ang = tanf(ang);
    }
    printf("Tan(ang): %f\n", ang);
}

That code computes the tangent of an angle in 2 different ways, one calling the tanf function from the dinamically linked library libm.a, and the second one using inline assembly code. Note that I comment parts of the code alternatively. The code performs the operation several times to get meaningful results with command time y linux terminal.

The version that uses math library takes around 0.040s. The version that uses assembly code takes around 0.440s; ten times more.

These are the results of disassembly. Both have been compiled with -O3 option.

LIBM

4005ad: b8 db 0f c9 3f          mov    $0x3fc90fdb,%eax
  4005b2:   89 45 f8                mov    %eax,-0x8(%rbp)
  4005b5:   f3 0f 10 45 f8          movss  -0x8(%rbp),%xmm0
  4005ba:   e8 e1 fe ff ff          callq  4004a0 <tanf@plt>
  4005bf:   f3 0f 11 45 f8          movss  %xmm0,-0x8(%rbp)
  4005c4:   83 45 fc 01             addl   $0x1,-0x4(%rbp)
  4005c8:   83 7d fc 00             cmpl   $0x0,-0x4(%rbp)
  4005cc:   7e df                   jle    4005ad <main+0x19>

ASM

40050d: b8 db 0f c9 3f          mov    $0x3fc90fdb,%eax
  400512:   89 45 f8                mov    %eax,-0x8(%rbp)
  400515:   d9 45 f8                flds   -0x8(%rbp)
  400518:   d9 f2                   fptan  
  40051a:   d9 c9                   fxch   %st(1)
  40051c:   d9 5d f8                fstps  -0x8(%rbp)
  40051f:   83 45 fc 01             addl   $0x1,-0x4(%rbp)
  400523:   83 7d fc 00             cmpl   $0x0,-0x4(%rbp)
  400527:   7e e4                   jle    40050d <main+0x19>

Any idea? Thanks.

I think I got an idea. Browsing the glibc code I found out that tanf function is implemented through a polynomial aproximation and using the sse extension. I guess that's turn to be faster than microcode for the fptan instruction.

like image 934
pcremades Avatar asked Oct 02 '22 09:10

pcremades


1 Answers

There's a major difference in the implementation of those functions.

fptan is a legacy 8087 instruction using the floating point stack. Even originally 8087 instructions were microcoded. Invoking fptan instruction caused a predefined program to be run in the 8087 CPU, which would have utilized the fundamental capabilities of the processor, such as floating point addition or even multiplication. Microcoding bypasses some stages of the "natural" pipelining, e.g. prefetch and decode and it speeds up the process.

The algorithm selected for trigonometric functions in 8087 was CORDIC.

Even though the microcoding made fptan faster than explicitly calling each instruction, this was not the end of floating point processor development; We could rather say, that 8087 development was over. In future processors fptan must probably be implemented as is, as an IP block, that behaves identically to the original instruction with some glue logic in order of producing bit-by-bit exact output as the original.

Later processors first recycled the FP stack for "MMX". Then a completely new set of registers were introduced (XMM) along with an instruction set (SSE) capable of parallel execution of basic floating point operations. First at all, support for extended precision floating points (80-bits) was dropped. Then again, 20+ years of Moore's Law allowed allocation of much higher transistor count to build up e.g. 64x64 bit parallel multipliers speeding up the multiplication throughput.

Other instructions have suffered too: loop was once faster than sub ecx, 1; jnz combination. aam is probably slower today than conditionally adding 10 to some nibble of eax -- these 20+ years of Moore's law have allowed millions of transistors to speed the prefetch stage: In 8086 every single byte in the instruction encoding counted as one more cycle. Today several instruction execute within a single cycle, because the instruction are already fetched from memory.

That being said, you can also try if a single instruction, such as aam is actually faster, than implementing it's contents using an equivalent set of simpler instructions, that are optimized. This is the benefit of a library: they can use the fptan instruction, but they don't need, if the processor architecture supports some faster set of instructions, more parallelism, a faster algorithm or all of those.

like image 144
Aki Suihkonen Avatar answered Oct 18 '22 16:10

Aki Suihkonen