High, could anyone help me understand why is it more efficient calling math library functions than writing inline assembly code to perform the same operation?. I wrote this simple test:
#include <stdio.h>
#define __USE_GNU
#include <math.h>
void main( void ){
float ang;
int i;
for( i = 0; i< 1000000; i++){
ang = M_PI_2 * i/2000000;
/*__asm__ ( "fld %0;"
"fptan;"
"fxch;"
"fstp %0;" : "=m" (ang) : "m" (ang)
) ;*/
ang = tanf(ang);
}
printf("Tan(ang): %f\n", ang);
}
That code computes the tangent of an angle in 2 different ways, one calling the tanf function from the dinamically linked library libm.a, and the second one using inline assembly code. Note that I comment parts of the code alternatively. The code performs the operation several times to get meaningful results with command time y linux terminal.
The version that uses math library takes around 0.040s. The version that uses assembly code takes around 0.440s; ten times more.
These are the results of disassembly. Both have been compiled with -O3 option.
LIBM
4005ad: b8 db 0f c9 3f mov $0x3fc90fdb,%eax
4005b2: 89 45 f8 mov %eax,-0x8(%rbp)
4005b5: f3 0f 10 45 f8 movss -0x8(%rbp),%xmm0
4005ba: e8 e1 fe ff ff callq 4004a0 <tanf@plt>
4005bf: f3 0f 11 45 f8 movss %xmm0,-0x8(%rbp)
4005c4: 83 45 fc 01 addl $0x1,-0x4(%rbp)
4005c8: 83 7d fc 00 cmpl $0x0,-0x4(%rbp)
4005cc: 7e df jle 4005ad <main+0x19>
ASM
40050d: b8 db 0f c9 3f mov $0x3fc90fdb,%eax
400512: 89 45 f8 mov %eax,-0x8(%rbp)
400515: d9 45 f8 flds -0x8(%rbp)
400518: d9 f2 fptan
40051a: d9 c9 fxch %st(1)
40051c: d9 5d f8 fstps -0x8(%rbp)
40051f: 83 45 fc 01 addl $0x1,-0x4(%rbp)
400523: 83 7d fc 00 cmpl $0x0,-0x4(%rbp)
400527: 7e e4 jle 40050d <main+0x19>
Any idea? Thanks.
I think I got an idea. Browsing the glibc code I found out that tanf function is implemented through a polynomial aproximation and using the sse extension. I guess that's turn to be faster than microcode for the fptan instruction.
There's a major difference in the implementation of those functions.
fptan
is a legacy 8087 instruction using the floating point stack. Even originally 8087 instructions were microcoded. Invoking fptan
instruction caused a predefined program to be run in the 8087 CPU, which would have utilized the fundamental capabilities of the processor, such as floating point addition or even multiplication. Microcoding bypasses some stages of the "natural" pipelining, e.g. prefetch and decode and it speeds up the process.
The algorithm selected for trigonometric functions in 8087 was CORDIC.
Even though the microcoding made fptan faster than explicitly calling each instruction, this was not the end of floating point processor development; We could rather say, that 8087 development was over. In future processors fptan must probably be implemented as is, as an IP block, that behaves identically to the original instruction with some glue logic in order of producing bit-by-bit exact output as the original.
Later processors first recycled the FP stack for "MMX". Then a completely new set of registers were introduced (XMM) along with an instruction set (SSE) capable of parallel execution of basic floating point operations. First at all, support for extended precision floating points (80-bits) was dropped. Then again, 20+ years of Moore's Law allowed allocation of much higher transistor count to build up e.g. 64x64 bit parallel multipliers speeding up the multiplication throughput.
Other instructions have suffered too: loop
was once faster than sub ecx, 1; jnz
combination. aam
is probably slower today than conditionally adding 10 to some nibble of eax -- these 20+ years of Moore's law have allowed millions of transistors to speed the prefetch stage: In 8086 every single byte in the instruction encoding counted as one more cycle. Today several instruction execute within a single cycle, because the instruction are already fetched from memory.
That being said, you can also try if a single instruction, such as aam
is actually faster, than implementing it's contents using an equivalent set of simpler instructions, that are optimized. This is the benefit of a library: they can use the fptan instruction, but they don't need, if the processor architecture supports some faster set of instructions, more parallelism, a faster algorithm or all of those.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With