I wanted to try making my own absolute value function. I figured that the fastest way to calculate absolute value would be to simply mask out the sign bit (the last bit in IEEE 754). I wanted to compare it's speed to the standard <code>abs</code> function. Here is my implementation: <pre class="prettyprint lang-cpp prettyprint-override"><code>// Union used for type punning union float_uint_u { float f_val; unsigned int ui_val; }; // 'MASK' has all bits == 1 except the last one constexpr unsigned int MASK = ~(1 << (sizeof(int) * 8 - 1)); float abs_bitwise(float value) { float_uint_u ret; ret.f_val = value; ret.ui_val &= MASK; return ret.f_val; } </code></pre> For the record, I know that this sort of type punning is not standard C++. However, this is just for educational purposes, and according to the docs, this is supported in GCC. I figured this should be the fastest way to calculate absolute value, so it should at the very least be as fast as the standard implementation. However, timing 100000000 iterations of random values, I got the following results: <pre class="prettyprint lang-none prettyprint-override"><code>Bitwise time: 5.47385 | STL time: 5.15662 Ratio: 1.06152 </code></pre> My <code>abs</code> function is about 6% slower. <h3>Assembly output</h3> I compiled with <code>-O2</code> optimization and the <code>-S</code> option (assembly output) to help determine what was going on. I have extracted the relevant portions: <pre class="prettyprint lang-none prettyprint-override"><code>; 16(%rsp) is a value obtained from standard input movss 16(%rsp), %xmm0 andps .LC5(%rip), %xmm0 ; .LC5 == 2147483647 movq %rbp, %rdi cvtss2sd %xmm0, %xmm0 movl 16(%rsp), %eax movq %rbp, %rdi andl $2147483647, %eax movd %eax, %xmm0 cvtss2sd %xmm0, %xmm0 </code></pre> <h3>Observations</h3> I'm not great at assembly, but the main thing I noticed is that the standard function operates directly on the <code>xmm0</code> register. But with mine, it first moves the value to <code>eax</code> (for some reason), performs the <code>and</code>, and then moves it into <code>xmm0</code>. I'm assuming the extra <code>mov</code> is where the slow down happens. I also noticed that, for the standard, it stores the bit mask elsewhere in the program vs an immediate. I'm guessing that's not significant, however. The two versions also use different instructions (e.g. <code>movl</code> vs <code>movss</code>). <h3>System info</h3> This was compiled with g++ on Debian Linux (unstable branch). <code>g++ --version</code> output: <pre class="prettyprint lang-none prettyprint-override"><code>g++ (Debian 10.2.1-6) 10.2.1 20210110 </code></pre> <hr> If these two versions of the code both calculate absolute value the same way (via an <code>and</code>), why doesn't the optimizer generate the same code? Specifically, why does it feel the need to include an extra <code>mov</code> when it optimizes my implementation?

I got a bit different assembly. According to the x86_64 Linux ABI, a <code>float</code> argument is passed via <code>xmm0</code>. With standard <code>fabs</code>, the bitwise <code>AND</code> operation is performed directly on this register (Intel syntax): <pre class="prettyprint"><code>andps xmm0, XMMWORD PTR .LC0[rip] # .LC0 contains 0x7FFFFFFF ret </code></pre> However, in your case, the bitwise <code>AND</code> is performed on objects of type <code>unsigned int</code>. Therefore, GCC does the same which requires to move <code>xmm0</code> to <code>eax</code> first: <pre class="prettyprint"><code>movd eax, xmm0 and eax, 2147483647 movd xmm0, eax ret </code></pre> Live demo: https://godbolt.org/z/xj8MMo I haven't found any way how to force the GCC optimizer to perform <code>AND</code> directly on <code>xmm0</code> with only pure C/C++ source code. It seems that efficient implementations need to be built upon assembler code or Intel intrinsic. Relevant question: How to perform a bitwise operation on floating point numbers. All the proposed solutions basically result in the same outcome. I also tried to use the <code>copysign</code> function, but the result was even worse. The generated machine code then conatiend x87 instructions. <hr> Anyway, it is quite interesting that the Clang optimizer was clever enough to make the assembly in all 3 cases equivalent: https://godbolt.org/z/b6Khv5.

Why is the standard "abs" function faster than mine?

Q: How to determine the absolute value of Excel?

We can use SUM ARRAY along with ABS to get the absolute value of a series of numbers in column or row. Suppose we are given a few numbers as below, so in this scenario, the SUM array formula for absolute values would be =SUM(ABS(A2:A6)). Now, select cell A7 in your spreadsheet, and enter the formula '=SUM(ABS(A2:A6))'.

Q: What does absolute value mean in Excel?

The ABS function in Excel returns the absolute value of a number. In other words: the ABS function removes the minus sign (-) from a negative number, making it positive. 1. For example, the ABS function in cell B1 below returns the absolute value of a negative number.

Q: Can the ABS function be used to calculate absolute values?

However, the abs function as specified by IEEE754 mandates the signbit of the result to be 0, which would forbid the result -0.0. I personally think anything used to calculate some "absolute value" should match this behavior.

Q: What is the range of the ABS() function?

In plain English, for 16 bit integers, the range is -32768 … + 32767. Thus if you pass -32768 to the abs () function, the result is undefined. The problem of course in an embedded system is that undefined operations are just dangerous, so surely an embedded compiler will do something sensible, like return +32767 if you pass -32768 to abs?

Q: What is the difference between ABS and ABS ()?

If the argument is an integer or floating-point number, abs () returns the absolute value in integer or float. In the case of a complex number, abs () returns only the magnitude part and that can also be a floating-point number.

Q: What is ABS () function in C++?

The abs () takes only one argument, a number whose absolute value is to be returned. The argument can be an integer, a floating-point number, or a complex number. If the argument is an integer or floating-point number, abs () returns the absolute value in integer or float.

Tags:

c++

compiler-optimization

gcc

g++

I wanted to try making my own absolute value function. I figured that the fastest way to calculate absolute value would be to simply mask out the sign bit (the last bit in IEEE 754). I wanted to compare it's speed to the standard abs function. Here is my implementation:

// Union used for type punning
union float_uint_u
{
    float f_val;
    unsigned int ui_val;
};

// 'MASK' has all bits == 1 except the last one
constexpr unsigned int MASK = ~(1 << (sizeof(int) * 8 - 1));

float abs_bitwise(float value)
{
    float_uint_u ret;
    ret.f_val = value;
    ret.ui_val &= MASK;
       
    return ret.f_val;
}

For the record, I know that this sort of type punning is not standard C++. However, this is just for educational purposes, and according to the docs, this is supported in GCC.

I figured this should be the fastest way to calculate absolute value, so it should at the very least be as fast as the standard implementation. However, timing 100000000 iterations of random values, I got the following results:

Bitwise time: 5.47385 | STL time: 5.15662
Ratio: 1.06152

My abs function is about 6% slower.

Assembly output

I compiled with -O2 optimization and the -S option (assembly output) to help determine what was going on. I have extracted the relevant portions:

; 16(%rsp) is a value obtained from standard input
movss   16(%rsp), %xmm0
andps   .LC5(%rip), %xmm0 ; .LC5 == 2147483647
movq    %rbp, %rdi
cvtss2sd    %xmm0, %xmm0

movl    16(%rsp), %eax
movq    %rbp, %rdi
andl    $2147483647, %eax
movd    %eax, %xmm0
cvtss2sd    %xmm0, %xmm0

Observations

I'm not great at assembly, but the main thing I noticed is that the standard function operates directly on the xmm0 register. But with mine, it first moves the value to eax (for some reason), performs the and, and then moves it into xmm0. I'm assuming the extra mov is where the slow down happens. I also noticed that, for the standard, it stores the bit mask elsewhere in the program vs an immediate. I'm guessing that's not significant, however. The two versions also use different instructions (e.g. movl vs movss).

System info

This was compiled with g++ on Debian Linux (unstable branch). g++ --version output:

g++ (Debian 10.2.1-6) 10.2.1 20210110

If these two versions of the code both calculate absolute value the same way (via an and), why doesn't the optimizer generate the same code? Specifically, why does it feel the need to include an extra mov when it optimizes my implementation?

549

asked Feb 03 '21 08:02

Lysol

Video Answer

1 Answers

I got a bit different assembly. According to the x86_64 Linux ABI, a float argument is passed via xmm0. With standard fabs, the bitwise AND operation is performed directly on this register (Intel syntax):

andps xmm0, XMMWORD PTR .LC0[rip] # .LC0 contains 0x7FFFFFFF
ret

However, in your case, the bitwise AND is performed on objects of type unsigned int. Therefore, GCC does the same which requires to move xmm0 to eax first:

movd eax, xmm0
and  eax, 2147483647
movd xmm0, eax
ret

Live demo: https://godbolt.org/z/xj8MMo

I haven't found any way how to force the GCC optimizer to perform AND directly on xmm0 with only pure C/C++ source code. It seems that efficient implementations need to be built upon assembler code or Intel intrinsic.

Relevant question: How to perform a bitwise operation on floating point numbers. All the proposed solutions basically result in the same outcome.

I also tried to use the copysign function, but the result was even worse. The generated machine code then conatiend x87 instructions.

Anyway, it is quite interesting that the Clang optimizer was clever enough to make the assembly in all 3 cases equivalent: https://godbolt.org/z/b6Khv5.

178

answered Sep 21 '22 12:09

Daniel Langr

Related questions
                            
                                When is copy-and-swap idiom not applicable
                            
                                Any simple way to check if two numbers have different signs?
                            
                                Can I legally use a struct with overloaded operator() as Compare for std::upper_bound?
                            
                                CLion run program in separate system terminal
                            
                                Why do I get "ld: warning: direct access in _main to global weak symbol" in this simple code? [duplicate]
                            
                                difference between template parameters typename vs non typenames?
                            
                                Global scope resolution in the presence of using namespace
                            
                                Is it legal to partially specialise variadic template inner class with args from variadic template of an outer class
                            
                                Abstract class as an interface, without the vtable
                            
                                Mixing types and nontypes in variadic template parameters?
                            
                                Is a data member of a temporary object an xvalue in C++11?
                            
                                Static const declaration, constexpr definition of variable, valid c++?
                            
                                Is the Union-Find (or Disjoint Set) data structure in STL?
                            
                                Is using the result of new char[] or malloc to casted float * is UB (strict aliasing violation)?
                            
                                Is replacing `this` with a different type allowed?
                            
                                "Ambiguous base class" error in template context
                            
                                Why does a priority queue require front(), pop_back() from the underlying container instead of back(), pop_back()?
                            
                                Read uint8_t from std::stringstream as a numeric type
                            
                                What is an indirect goto statement?
                            
                                C++ Lambda Expressions: Capture Clause vs Argument List; what is the crucial difference?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With