I have ran into a curious problem. An algorithm I am working on consists of lots of computations like this <pre class="prettyprint"><code>q = x(0)*y(0)*z(0) + x(1)*y(1)*z(1) + ... </code></pre> where the length of summation is between 4 and 7. The original computations are all done using 64-bit precision. For experimentation, I tried using 32-bit precision for x,y,z input values (so that computations are performed using 32-bit), and storing final result as 64-bit value (straightforward cast). I expected 32-bit performance to be better (cache size, SIMD size, etc.), but to my surprise there was no difference in performance, maybe even decrease. The architecture in question is Intel 64, Linux, and GCC. Both codes do seem to use SSE and arrays in both cases are aligned to 16 byte boundary. Why would it be so? My guess so far is that 32-bit precision can use SSE only on the first four elements, with the rest being done serially compounded by cast overhead.

On x87 at least, everything is really done in 80-bit precision internally. The precision really just determines how many of those bits are stored in memory. This is part of the reason why different optimization settings can change results slightly: They change the amount of rounding from 80-bit to 32- or 64-bit. In practice, using 80-bit floating point (<code>long double</code> in C and C++, <code>real</code> in D) is usually slow because there's no efficient way to load and store 80 bits from memory. 32- and 64-bit are usually equally fast provided that memory bandwidth isn't the bottleneck, i.e. if everything is in cache anyhow. 64-bit can be slower if either of the following happens: <ol> <li>Memory bandwidth is the bottleneck.</li> <li>The 64-bit numbers aren't properly aligned on 8-byte boundaries. 32-bit numbers only require 4-byte alignment for optimal efficiency, so they're less finicky. Some compilers (the Digital Mars D compiler comes to mind) don't always get this right for 64-bit doubles stored on the stack. This causes twice the amount of memory operations to be necessary to load one, in practice resulting in about a 2x performance hit compared to properly aligned 64-bit floats or 32-bit floats.</li> </ol> As far as SIMD optimizations go, it should be noted that most compilers are horrible at auto-vectorizing code. If you don't want to write directly in assembly language, the best way to take advantage of these instructions is to use things like array-wise operations, which are available, for example, in D, and implemented in terms of SSE instructions. Similarly, in C or C++, you would probably want to use a high level library of functions that are SSE-optimized, though I don't know of a good one off the top of my head because I mostly program in D.

32-bit versus 64-bit floating-point performance

Tags:

performance

floating-point

precision

I have ran into a curious problem. An algorithm I am working on consists of lots of computations like this

q = x(0)*y(0)*z(0) + x(1)*y(1)*z(1) + ...

where the length of summation is between 4 and 7.

The original computations are all done using 64-bit precision. For experimentation, I tried using 32-bit precision for x,y,z input values (so that computations are performed using 32-bit), and storing final result as 64-bit value (straightforward cast).

I expected 32-bit performance to be better (cache size, SIMD size, etc.), but to my surprise there was no difference in performance, maybe even decrease.

The architecture in question is Intel 64, Linux, and GCC. Both codes do seem to use SSE and arrays in both cases are aligned to 16 byte boundary.

Why would it be so? My guess so far is that 32-bit precision can use SSE only on the first four elements, with the rest being done serially compounded by cast overhead.

811

asked Jun 29 '10 01:06

Anycorn

1 Answers

On x87 at least, everything is really done in 80-bit precision internally. The precision really just determines how many of those bits are stored in memory. This is part of the reason why different optimization settings can change results slightly: They change the amount of rounding from 80-bit to 32- or 64-bit.

In practice, using 80-bit floating point (long double in C and C++, real in D) is usually slow because there's no efficient way to load and store 80 bits from memory. 32- and 64-bit are usually equally fast provided that memory bandwidth isn't the bottleneck, i.e. if everything is in cache anyhow. 64-bit can be slower if either of the following happens:

Memory bandwidth is the bottleneck.
The 64-bit numbers aren't properly aligned on 8-byte boundaries. 32-bit numbers only require 4-byte alignment for optimal efficiency, so they're less finicky. Some compilers (the Digital Mars D compiler comes to mind) don't always get this right for 64-bit doubles stored on the stack. This causes twice the amount of memory operations to be necessary to load one, in practice resulting in about a 2x performance hit compared to properly aligned 64-bit floats or 32-bit floats.

As far as SIMD optimizations go, it should be noted that most compilers are horrible at auto-vectorizing code. If you don't want to write directly in assembly language, the best way to take advantage of these instructions is to use things like array-wise operations, which are available, for example, in D, and implemented in terms of SSE instructions. Similarly, in C or C++, you would probably want to use a high level library of functions that are SSE-optimized, though I don't know of a good one off the top of my head because I mostly program in D.

182

answered Oct 08 '22 00:10

dsimcha

Related questions
                            
                                What are the first issues to check while optimizing an existing database?
                            
                                How "slow" is python for game development? [closed]
                            
                                Lookup Table vs if-else
                            
                                Python: performance comparison of using `pickle` or `marshal` and using `re`
                            
                                Fastest way to send keystrokes C#
                            
                                Are member-initialization lists really more efficient?
                            
                                Python OpenCV streaming from camera - multithreading, timestamps
                            
                                Getting Factors of a Number
                            
                                is Queue.Synchronized faster than using a Lock()?
                            
                                TableLayoutPanel responds very slowly to events
                            
                                Extension Methods: Performance issue when using too much? [duplicate]
                            
                                Which .NET collection is faster: enumerating foreach Dictionary<>.Values or List<>?
                            
                                CppCheck. The scope of the variable can be reduced (and loop)
                            
                                SQL Server Clustered Index - Order of Index Question
                            
                                Does "use ...." at the top add overhead to a Perl script?
                            
                                PHP Performance question: Faster to leave duplicates in array that will be searched or do array_unique?
                            
                                haskell list comprehension performance
                            
                                Why the first call to constructor takes 10 times more time than other ones?
                            
                                Delphi Adding Items to ComboBox Speed
                            
                                SQL server concurrent accessing

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With