Benefits of x87 over SSE

Tags:

I know that x87 has higher internal precision, which is probably the biggest difference that people see between it and SSE operations. But I have to wonder, is there any other benefit to using x87? I have a habit of typing -mfpmath=sse automatically in any project, and I wonder if I'm missing anything else that the x87 FPU offers.

431

asked Dec 04 '09 03:12

Tom

2 Answers

For hand-written asm, x87 has some instructions that don't exist in the SSE instruction set.

Off the top of my head, it's all trigonometric stuff like fsin, fcos, fatan, fatan2 and some exponential/logarithm stuff.

With gcc -O3 -ffast-math -mfpmath=387, GCC9 will still actually inline sin(x) as an fsin instruction, regardless of what the implementation in libm would have used. (https://godbolt.org/z/Euc5gp).

MSVC calls __libm_sse2_sin_precise when compiling for 32-bit x86.

If your code spends most of the time doing trigonometry, you may see a slight performance gain or loss if you use x87, depending on whether your standard math-library implementation using SSE1/SSE2 is faster or slower than the slow microcode for fsin on whatever CPU you're using.

CPU vendors don't put a lot of effort into optimizing the microcode for x87 instructions in the newest generations of CPUs because it's generally considered obsolete and rarely used. (Look at uop counts and throughput for complex x87 instructions in Agner Fog's instruction tables in recent generations of CPUs: more cycles than in older CPUs). The newer the CPU, the more likely x87 will be slower than many SSE or AVX instructions to compute log, exp, pow, or trig functions.

Even when x87 is available, not all math libraries choose to use complex instructions like fsin for implementing functions like sin(), or especially exp/log where integer tricks for manipulating the log-based FP bit-patterns are useful.

Some DSP algorithms use a lot of trig, but typically benefit a lot from auto-vectorization with SIMD math libraries.

However, for math-code where you spend most of your time doing additions, multiplications etc. SSE is usually faster.

Also related: Intel Underestimates Error Bounds by 1.3 quintillion - the worst case for fsin (catastrophic cancellation for fsin inputs very near pi) is very bad. Software can do better but only with slow extended-precision techniques.

100

answered Sep 28 '22 18:09

Nils Pipenbrinck

It's present on really old machines.

EOF

answered Sep 28 '22 18:09

Simeon Pilgrim

Related questions
                            
                                What does the LEAL assembly instruction do?
                            
                                JNZ & CMP Assembly Instructions
                            
                                Carry Flag, Auxiliary Flag and Overflow Flag in Assembly
                            
                                How many instructions are there on x86 today? [closed]
                            
                                Can GCC be coerced to generate efficient constructors for memory-aligned objects?
                            
                                java.lang.RuntimeException: Unable to instantiate application : ClassNotFoundException (Only on X86 architecture device)
                            
                                Difference in position-independent code: x86 vs x86-64
                            
                                How do Intel Xeon CPUs write to memory?
                            
                                How do x86 page tables work?
                            
                                difference between MMX and XMM register?
                            
                                why are separate icache and dcache needed [duplicate]
                            
                                Good reference for x86 assembly instructions [closed]
                            
                                Access x86 COM from x64 .NET
                            
                                Observing stale instruction fetching on x86 with self-modifying code
                            
                                Assembly 'call' vs 'jmp'
                            
                                difference in mfence and asm volatile ("" : : : "memory")
                            
                                Why doesn't GCC use partial registers?
                            
                                Does Linux use x86 CPU's PCID feature for TLB? If not, why?
                            
                                Does a memory barrier ensure that the cache coherence has been completed?
                            
                                Why can't you set the instruction pointer directly?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Benefits of x87 over SSE

Tags:

x86

x86-64

sse

x87

fpu

Tom

People also ask

2 Answers

Nils Pipenbrinck

Simeon Pilgrim

Recent Activity

Donate For Us