High-level programming languages often provide a function to determine the absolute-value of a floating-point value. For example, in the C standard library, there is the <code>fabs(double)</code> function. How is this library function actually implemented for x86 targets? What would actually be happening "under the hood" when I call a high-level function like this? Is it an expensive operation (a combination of multiplication and taking the square root)? Or is the result found just by removing a negative sign in memory?

In general, computing the absolute-value of a floating-point quantity is an extremely cheap and fast operation. In practically all cases, you can simply treat the <code>fabs</code> function from the standard library as a black box, sprinkling it in your algorithms where necessary, without any need to worry about how it will affect execution speed. If you want to understand why this is such a cheap operation, then you need to know a little bit about how floating-point values are represented. Although the C and C++ language standards do not actually mandate it, most implementations follow the IEEE-754 standard. In that standard, each floating-point value's representation contains a bit known as the sign bit, and this marks whether the value is positive or negative. For example, consider a <code>double</code>, which is a 64-bit double-precision floating-point value: <img src="https://i.stack.imgur.com/aMSbN.png" alt="Bit-level representation of a double-precision floating-point value"> (Image courtesy of Codekaizen, via Wikipedia, licensed under CC-bySA.) You can see the sign bit over there on the far left, in light blue. This is true for all precisions of floating-point values in IEEE-754. Therefore, taking the absolute value basically just amounts to flipping a byte in the value's representation in memory. In particular, you just need to mask off the sign bit (bitwise-AND), forcing it to 0—thus, unsigned. Assuming that your target architecture has hardware support for floating-point operations, this is generally a single, one-cycle instruction—basically, as fast as can possibly be. An optimizing compiler will inline a call to the <code>fabs</code> library function, emitting that single hardware instruction in its place. If your target architecture doesn't have hardware support for floating-point (which is pretty rare nowadays), then there will be a library that emulates these semantics in software, thus providing floating-point support. Typically, floating-point emulation is slow, but finding the absolute value is one of the fastest things you can do, since it is literally just manipulating a bit. You'll pay the overhead of a function call to <code>fabs</code>, but at worst, the implementation of that function will just involve reading the bytes from memory, masking off the sign bit, and storing the result back to memory. Looking specifically at x86, which does implement IEEE-754 in hardware, there are two main ways that your C compiler will transform a call to <code>fabs</code> into machine code. In 32-bit builds, where the legacy x87 FPU is being used for floating-point operations, it will emit an <code>fabs</code> instruction. (Yep, same name as the C function.) This strips the sign bit, if present, from the floating-point value at the top of the x87 register stack. On AMD processors and Intel Pentium 4, <code>fabs</code> is a 1-cycle instruction with a 2-cycle latency. On AMD Ryzen and all other Intel processors, this is a 1-cycle instruction with a 1-cycle latency. In 32-bit builds that can assume SSE support, and on all 64-bit builds (where SSE is always supported), the compiler will emit an <code>ANDPS</code> instruction* that does exactly what I described above: it bitwise-ANDs the floating-point value with a constant mask, masking out the sign bit. Notice that SSE2 doesn't have a dedicated instruction for taking the absolute value like x87 does, but that it doesn't even need one, because the multi-purpose bitwise-op instructions serve the job just fine. The execution time (cycles, latency, etc. characteristics) vary a bit more widely from one processor microarchitecture to another, but it generally has a throughput of 1–3 cycles, with a similar latency. If you like, you can look it up in Agner Fog's instruction tables for the processors of interest. If you're really interested in digging into this, you might see this answer (hat tip to Peter Cordes), which explores a variety of different ways to implement an absolute-value function using SSE instructions, comparing their performance and discussing how you could get a compiler to generate the appropriate code. As you can see, since you're just manipulating bits, there are a variety of possible solutions! In practice, though, the current crop of compilers do exactly as I've described for the C library function <code>fabs</code>, which makes sense, because this is the best general-purpose solution. __ * Technically, this might also be <code>ANDPD</code>, where the <code>D</code> means "double" (and the <code>S</code> meant "single"), but <code>ANDPD</code> requires SSE2 support. SSE supports single-precision floating-point operations, and was available all the way back to the Pentium III. SSE2 is required for double-precision floating-point operations, and was introduced with the Pentium 4. SSE2 is always supported on x86-64 CPUs. Whether <code>ANDPS</code> or <code>ANDPD</code> is used is a decision made by the compiler's optimizer; sometimes you will see <code>ANDPS</code> being used on a double-precision floating-point value, since it just requires writing the mask the right way. Also, on CPUs that support AVX instructions, you'll generally see a VEX-prefix on the <code>ANDPS</code>/<code>ANDPD</code> instruction, so that it becomes <code>VANDPS</code>/<code>VANDPD</code>. Details on how this works and what its purpose is can be found elsewhere online; suffice it to say that mixing VEX and non-VEX instructions can result in a performance penalty, so compilers try to avoid it. Again, though, both of these versions have the same effect and virtually identical execution speeds. Oh, and because SSE is a SIMD instruction set, it is possible to compute the absolute value of multiple floating-point values at once. This, as you might imagine, is especially efficient. Compilers with auto-vectorization capabilities will generate code like this where possible. Example (mask can either be generated on-the-fly, as shown here, or loaded as a constant):<pre class="prettyprint"><code>cmpeqd xmm1, xmm1 ; generate the mask (all 1s) in a temporary register psrld xmm1, 1 ; put 1s in but the left-most bit of each packed dword andps xmm0, xmm1 ; mask off sign bit in each packed floating-point value </code></pre>

How would fabs(double) be implemented on x86? Is it an expensive operation?

1 Answers

In general, computing the absolute-value of a floating-point quantity is an extremely cheap and fast operation.

In practically all cases, you can simply treat the fabs function from the standard library as a black box, sprinkling it in your algorithms where necessary, without any need to worry about how it will affect execution speed.

If you want to understand why this is such a cheap operation, then you need to know a little bit about how floating-point values are represented. Although the C and C++ language standards do not actually mandate it, most implementations follow the IEEE-754 standard. In that standard, each floating-point value's representation contains a bit known as the sign bit, and this marks whether the value is positive or negative. For example, consider a double, which is a 64-bit double-precision floating-point value:

Bit-level representation of a double-precision floating-point value
^{(Image courtesy of Codekaizen, via Wikipedia, licensed under CC-bySA.)}

You can see the sign bit over there on the far left, in light blue. This is true for all precisions of floating-point values in IEEE-754. Therefore, taking the absolute value basically just amounts to flipping a byte in the value's representation in memory. In particular, you just need to mask off the sign bit (bitwise-AND), forcing it to 0—thus, unsigned.

Assuming that your target architecture has hardware support for floating-point operations, this is generally a single, one-cycle instruction—basically, as fast as can possibly be. An optimizing compiler will inline a call to the fabs library function, emitting that single hardware instruction in its place.

If your target architecture doesn't have hardware support for floating-point (which is pretty rare nowadays), then there will be a library that emulates these semantics in software, thus providing floating-point support. Typically, floating-point emulation is slow, but finding the absolute value is one of the fastest things you can do, since it is literally just manipulating a bit. You'll pay the overhead of a function call to fabs, but at worst, the implementation of that function will just involve reading the bytes from memory, masking off the sign bit, and storing the result back to memory.

Looking specifically at x86, which does implement IEEE-754 in hardware, there are two main ways that your C compiler will transform a call to fabs into machine code.

In 32-bit builds, where the legacy x87 FPU is being used for floating-point operations, it will emit an fabs instruction. (Yep, same name as the C function.) This strips the sign bit, if present, from the floating-point value at the top of the x87 register stack. On AMD processors and Intel Pentium 4, fabs is a 1-cycle instruction with a 2-cycle latency. On AMD Ryzen and all other Intel processors, this is a 1-cycle instruction with a 1-cycle latency.

In 32-bit builds that can assume SSE support, and on all 64-bit builds (where SSE is always supported), the compiler will emit an ANDPS instruction^* that does exactly what I described above: it bitwise-ANDs the floating-point value with a constant mask, masking out the sign bit. Notice that SSE2 doesn't have a dedicated instruction for taking the absolute value like x87 does, but that it doesn't even need one, because the multi-purpose bitwise-op instructions serve the job just fine. The execution time (cycles, latency, etc. characteristics) vary a bit more widely from one processor microarchitecture to another, but it generally has a throughput of 1–3 cycles, with a similar latency. If you like, you can look it up in Agner Fog's instruction tables for the processors of interest.

If you're really interested in digging into this, you might see this answer (hat tip to Peter Cordes), which explores a variety of different ways to implement an absolute-value function using SSE instructions, comparing their performance and discussing how you could get a compiler to generate the appropriate code. As you can see, since you're just manipulating bits, there are a variety of possible solutions! In practice, though, the current crop of compilers do exactly as I've described for the C library function fabs, which makes sense, because this is the best general-purpose solution.

__
_{^* Technically, this might also be ANDPD, where the D means "double" (and the S meant "single"), but ANDPD requires SSE2 support. SSE supports single-precision floating-point operations, and was available all the way back to the Pentium III. SSE2 is required for double-precision floating-point operations, and was introduced with the Pentium 4. SSE2 is always supported on x86-64 CPUs. Whether ANDPS or ANDPD is used is a decision made by the compiler's optimizer; sometimes you will see ANDPS being used on a double-precision floating-point value, since it just requires writing the mask the right way.}
_{Also, on CPUs that support AVX instructions, you'll generally see a VEX-prefix on the ANDPS/ANDPD instruction, so that it becomes VANDPS/VANDPD. Details on how this works and what its purpose is can be found elsewhere online; suffice it to say that mixing VEX and non-VEX instructions can result in a performance penalty, so compilers try to avoid it. Again, though, both of these versions have the same effect and virtually identical execution speeds.}

_{Oh, and because SSE is a SIMD instruction set, it is possible to compute the absolute value of multiple floating-point values at once. This, as you might imagine, is especially efficient. Compilers with auto-vectorization capabilities will generate code like this where possible. Example (mask can either be generated on-the-fly, as shown here, or loaded as a constant):}

cmpeqd xmm1, xmm1     ; generate the mask (all 1s) in a temporary register
psrld  xmm1, 1        ; put 1s in but the left-most bit of each packed dword
andps  xmm0, xmm1     ; mask off sign bit in each packed floating-point value

answered Sep 19 '22 02:09

Cody Gray

Related questions
                            
                                Lua - packing IEEE754 single-precision floating-point numbers
                            
                                Implement "tolerant" `equals` & `hashCode` for a class with floating point members
                            
                                Real numbers - how to determine whether float or double is required?
                            
                                Do variables contain extra hidden metadata - aka When is zero not zero (but still is)
                            
                                Deterministic way of saying "promote everything to floating before calculation" in C++
                            
                                NAN propagation and IEEE 754 standard
                            
                                Converting a float into a string fraction representation
                            
                                (PHP) How to avoid scientific notation and show the actual large numbers?
                            
                                Why isn't Math.nextAfter(Double.MAX_VALUE, 1) equal to Double.INFINITY?
                            
                                Laravel (or PHP/MySQL?) cuts float numbers after decimal point
                            
                                How to convert strings to floats with perfect accuracy?
                            
                                How to get binary representation of floating-point number in PHP?
                            
                                How to use GL_HALF_FLOAT_OES typed textures in iOS?
                            
                                Are C doubles different to .NET doubles?
                            
                                For a floating point value a: Does a*0.0 == 0.0 always evaluate true for finite values of a?
                            
                                C printf using %d and %f
                            
                                python datetime.strptime: ignore fraction of a second
                            
                                Implicitly casting a float constant
                            
                                precision differences in matlab and c++
                            
                                Negative zero literal in golang

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How would fabs(double) be implemented on x86? Is it an expensive operation?

Tags:

floating-point

x86

absolute-value

AlexG

People also ask

1 Answers

Cody Gray

Recent Activity

Donate For Us