Double values store higher precision and are double the size of a float, but are Intel CPUs optimized for floats? That is, are double operations just as fast or faster than float operations for +, -, *, and /? Does the answer change for 64-bit architectures?

There isn't a single "intel CPU", especially in terms of what operations are optimized with respect to others!, but most of them, at CPU level (specifically within the FPU), are such that the answer to your question: <blockquote> are double operations just as fast or faster than float operations for +, -, *, and /? </blockquote> is "yes" -- within the CPU, except for division and sqrt which are somewhat slower for <code>double</code> than for <code>float</code>. (Assuming your compiler uses SSE2 for scalar FP math, like all x86-64 compilers do, and some 32-bit compilers depending on options. Legacy x87 doesn't have different widths in registers, only in memory (it converts on load/store), so historically even sqrt and division were just as slow for <code>double</code>). For example, Haswell has a <code>divsd</code> throughput of one per 8 to 14 cycles (data-dependent), but a <code>divss</code> (scalar single) throughput of one per 7 cycles. x87 <code>fdiv</code> is 8 to 18 cycle throughput. (Numbers from https://agner.org/optimize/. Latency correlates with throughput for division, but is higher than the throughput numbers.) The <code>float</code> versions of many library functions like <code>logf(float)</code> and <code>sinf(float)</code> will also be faster than <code>log(double)</code> and <code>sin(double)</code>, because they have many fewer bits of precision to get right. They can use polynomial approximations with fewer terms to get full precision for <code>float</code> vs. <code>double</code> <hr> However, taking up twice the memory for each number clearly implies heavier load on the cache(s) and more memory bandwidth to fill and spill those cache lines from/to RAM; the time you care about performance of a floating-point operation is when you're doing a lot of such operations, so the memory and cache considerations are crucial. @Richard's answer points out that there are also other ways to perform FP operations (the SSE / SSE2 instructions; good old MMX was integers-only), especially suitable for simple ops on lot of data ("SIMD", single instruction / multiple data) where each vector register can pack 4 single-precision floats or only 2 double-precision ones, so this effect will be even more marked. In the end, you do have to benchmark, but my prediction is that for reasonable (i.e., large;-) benchmarks, you'll find advantage to sticking with single precision (assuming of course that you don't need the extra bits of precision!-).

If all floating-point calculations are performed within the FPU, then, no, there is no difference between a <code>double</code> calculation and a <code>float</code> calculation because the floating point operations are actually performed with 80 bits of precision in the FPU stack. Entries of the FPU stack are rounded as appropriate to convert the 80-bit floating point format to the <code>double</code> or <code>float</code> floating-point format. Moving <code>sizeof(double)</code> bytes to/from RAM versus <code>sizeof(float)</code> bytes is the only difference in speed. If, however, you have a vectorizable computation, then you can use the SSE extensions to run four <code>float</code> calculations in the same time as two <code>double</code> calculations. Therefore, clever use of the SSE instructions and the XMM registers can allow higher throughput on calculations that only use <code>float</code>s.

I just want to add to the already existing great answers that the <code>__m256?</code> family of same-instruction-multiple-data (SIMD) C++ intrinsic functions operate on either 4 <code>double</code> s in parallel (e.g. <code>_mm256_add_pd</code>), or 8 <code>float</code>s in parallel (e.g. <code>_mm256_add_ps</code>). I'm not sure if this can translate to an actual speed up, but it seems possible to process 2x as many floats per instruction when SIMD is used.

Is using double faster than float?

4 Answers

There isn't a single "intel CPU", especially in terms of what operations are optimized with respect to others!, but most of them, at CPU level (specifically within the FPU), are such that the answer to your question:

are double operations just as fast or faster than float operations for +, -, *, and /?

is "yes" -- within the CPU, except for division and sqrt which are somewhat slower for double than for float. (Assuming your compiler uses SSE2 for scalar FP math, like all x86-64 compilers do, and some 32-bit compilers depending on options. Legacy x87 doesn't have different widths in registers, only in memory (it converts on load/store), so historically even sqrt and division were just as slow for double).

For example, Haswell has a divsd throughput of one per 8 to 14 cycles (data-dependent), but a divss (scalar single) throughput of one per 7 cycles. x87 fdiv is 8 to 18 cycle throughput. (Numbers from https://agner.org/optimize/. Latency correlates with throughput for division, but is higher than the throughput numbers.)

The float versions of many library functions like logf(float) and sinf(float) will also be faster than log(double) and sin(double), because they have many fewer bits of precision to get right. They can use polynomial approximations with fewer terms to get full precision for float vs. double

However, taking up twice the memory for each number clearly implies heavier load on the cache(s) and more memory bandwidth to fill and spill those cache lines from/to RAM; the time you care about performance of a floating-point operation is when you're doing a lot of such operations, so the memory and cache considerations are crucial.

@Richard's answer points out that there are also other ways to perform FP operations (the SSE / SSE2 instructions; good old MMX was integers-only), especially suitable for simple ops on lot of data ("SIMD", single instruction / multiple data) where each vector register can pack 4 single-precision floats or only 2 double-precision ones, so this effect will be even more marked.

In the end, you do have to benchmark, but my prediction is that for reasonable (i.e., large;-) benchmarks, you'll find advantage to sticking with single precision (assuming of course that you don't need the extra bits of precision!-).

196

answered Sep 24 '22 18:09

Alex Martelli

If all floating-point calculations are performed within the FPU, then, no, there is no difference between a double calculation and a float calculation because the floating point operations are actually performed with 80 bits of precision in the FPU stack. Entries of the FPU stack are rounded as appropriate to convert the 80-bit floating point format to the double or float floating-point format. Moving sizeof(double) bytes to/from RAM versus sizeof(float) bytes is the only difference in speed.

If, however, you have a vectorizable computation, then you can use the SSE extensions to run four float calculations in the same time as two double calculations. Therefore, clever use of the SSE instructions and the XMM registers can allow higher throughput on calculations that only use floats.

answered Sep 20 '22 18:09

Daniel Trebbien

Another point to consider is if you are using GPU(the graphics card). I work with a project that is numerically intensive, yet we do not need the percision that double offers. We use GPU cards to help further speed the processing. CUDA GPU's need a special package to support double, and the amount of local RAM on a GPU is quite fast, but quite scarce. As a result, using float also doubles the amount of data we can store on the GPU.

Yet another point is the memory. Floats take half as much RAM as doubles. If you are dealing with VERY large datasets, this can be a really important factor. If using double means you have to cache to disk vs pure ram, your difference will be huge.

So for the application I am working with, the difference is quite important.

answered Sep 23 '22 18:09

Miley

I just want to add to the already existing great answers that the __m256? family of same-instruction-multiple-data (SIMD) C++ intrinsic functions operate on either 4 double s in parallel (e.g. _mm256_add_pd), or 8 floats in parallel (e.g. _mm256_add_ps).

I'm not sure if this can translate to an actual speed up, but it seems possible to process 2x as many floats per instruction when SIMD is used.

answered Sep 24 '22 18:09

bobobobo

Related questions
                            
                                What is a nondeduced context?
                            
                                Can I use identical names for fields and constructor parameters?
                            
                                Is sizeof in C++ evaluated at compilation time or run time?
                            
                                Configuring the GCC compiler switches in Qt, QtCreator, and QMake
                            
                                Refactoring with C++ 11
                            
                                Does C or C++ guarantee array < array + SIZE?
                            
                                "relocation R_X86_64_32S against " linking Error
                            
                                Assignment in C++ occurs despite exception on the right side
                            
                                DEBUG macros in C++
                            
                                Why is C++ template use not recommended in a space/radiated environment?
                            
                                Print leading zeros with C++ output operator?
                            
                                Why doesn't this reinterpret_cast compile?
                            
                                What is a static constructor?
                            
                                How to compile C++ with C++11 support in Mac Terminal
                            
                                C++ Abstract Class: constructor yes or no?
                            
                                Difference between std::function<> and a standard function pointer? [duplicate]
                            
                                Py_Initialize fails - unable to load the file system codec
                            
                                How to solve "Unresolved inclusion: <iostream>" in a C++ file in Eclipse CDT?
                            
                                non-trivial designated initializers not supported
                            
                                sorting a vector of structs [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is using double faster than float?

Tags:

c++

performance

x86

intel

osx-snow-leopard

Brent Faust

People also ask