I was trying a question on arrays in InterviewBit. In this question I made an inline function returning the absolute value of an integer. But I was told that my algorithm was not efficient on submitting it. But when I changed to using <code>abs()</code> from C++ library it gave a correct answer verdict. Here is my function that got an inefficient verdict - <pre class="prettyprint"><code>inline int abs(int x){return x>0 ? x : -x;} int Solution::coverPoints(vector<int> &X, vector<int> &Y) { int l = X.size(); int i = 0; int ans = 0; while (i<l-1){ ans = ans + max(abs(X[i]-X[i+1]), abs(Y[i]-Y[i+1])); i++; } return ans; } </code></pre> Here's the one that got the correct answer - <pre class="prettyprint"><code>int Solution::coverPoints(vector<int> &X, vector<int> &Y) { int l = X.size(); int i = 0; int ans = 0; while (i<l-1){ ans = ans + max(abs(X[i]-X[i+1]), abs(Y[i]-Y[i+1])); i++; } return ans; } </code></pre> Why did this happened, as I thought that inline functions are fastest as no calling is done? Or is the site having an error? And if the site is correct, what does C++ <code>abs()</code> use that is faster than <code>inline abs()</code>?

Your <code>abs</code> performs branching based on a condition. While the built-in variant just removes the sign bit from the integer, most likely using just a couple of instructions. Possible assembly example (taken from here): <pre class="prettyprint"><code>cdq xor eax, edx sub eax, edx </code></pre> <blockquote> The cdq copies the sign of the register eax to register edx. For example, if it is a positive number, edx will be zero, otherwise, edx will be 0xFFFFFF which denotes -1. The xor operation with the origin number will change nothing if it is a positive number (any number xor 0 will not change). However, when eax is negative, eax xor 0xFFFFFF yields (not eax). The final step is to subtract edx from eax. Again, if eax is positive, edx is zero, and the final value is still the same. For negative values, (~ eax) – (-1) = –eax which is the value wanted. </blockquote> As you can see this approach uses only three simple arithmetic instructions and no conditional branching at all. Edit: After some research it turned out that many built-in implementations of abs use the same approach, <code>return __x >= 0 ? __x : -__x;</code>, and such a pattern is an obvious target for compiler optimization to avoid unnecessary branching. However, that does not justify the use of custom <code>abs</code> implementation as it violates the DRY principle and no one can guarantee that your implementation is going to be just as good for more sophisticated scenarios and/or unusual platforms. Typically one should think about rewriting some of the library functions only when there is a definite performance problem or some other defect detected in existing implementation. Edit2: Just switching from int to float shows considerable performance degradation: <pre class="prettyprint"><code>float libfoo(float x) { return ::std::fabs(x); } andps xmm0, xmmword ptr [rip + .LCPI0_0] </code></pre> And a custom version: <pre class="prettyprint"><code>inline float my_fabs(float x) { return x>0.0f?x:-x; } float myfoo(float x) { return my_fabs(x); } movaps xmm1, xmmword ptr [rip + .LCPI1_0] # xmm1 = [-0.000000e+00,-0.000000e+00,-0.000000e+00,-0.000000e+00] xorps xmm1, xmm0 xorps xmm2, xmm2 cmpltss xmm2, xmm0 andps xmm0, xmm2 andnps xmm2, xmm1 orps xmm0, xmm2 </code></pre> online compiler

Why does an inline function have lower efficiency than an in-built function?

Tags:

c++

arrays

I was trying a question on arrays in InterviewBit. In this question I made an inline function returning the absolute value of an integer. But I was told that my algorithm was not efficient on submitting it. But when I changed to using abs() from C++ library it gave a correct answer verdict.

Here is my function that got an inefficient verdict -

inline int abs(int x){return x>0 ? x : -x;}

int Solution::coverPoints(vector<int> &X, vector<int> &Y) {
    int l = X.size();
    int i = 0;
    int ans = 0;
    while (i<l-1){
        ans = ans + max(abs(X[i]-X[i+1]), abs(Y[i]-Y[i+1]));
        i++;
    }
    return ans;
}

Here's the one that got the correct answer -

int Solution::coverPoints(vector<int> &X, vector<int> &Y) {
    int l = X.size();
    int i = 0;
    int ans = 0;
    while (i<l-1){
        ans = ans + max(abs(X[i]-X[i+1]), abs(Y[i]-Y[i+1]));
        i++;
    }
    return ans;
}

Why did this happened, as I thought that inline functions are fastest as no calling is done? Or is the site having an error? And if the site is correct, what does C++ abs() use that is faster than inline abs()?

680

asked Jul 09 '17 09:07

monster

2 Answers

I don't agree with their verdict. They are clearly wrong.

On current, optimizing compilers, both solutions produce the exact same output. And even, if they didn't produce the exact same, they would produce as efficient code as the library one (it could be a little surprising that everything matches: the algorithm, the registers used. Maybe because the actual library implementation is the same as OP's one?).

No sane optimizing compiler will create branch in your abs() code (if it can be done without a branch), as other answer suggests. If the compiler is not optimizing, then it may not inline library abs(), so it won't be fast either.

Optimizing abs() is one of the easiest thing to do for a compiler (just add an entry for it in the peephole optimizer, and done).

Furthermore, I've seen library implementations in the past, where abs() were implemented as a non-inline, library function (it was long time ago, though).

Proof that both implementations are the same:

GCC:

myabs:
    mov     edx, edi    ; argument passed in EDI by System V AMD64 calling convention
    mov     eax, edi
    sar     edx, 31
    xor     eax, edx
    sub     eax, edx
    ret

libabs:
    mov     edx, edi    ; argument passed in EDI by System V AMD64 calling convention
    mov     eax, edi
    sar     edx, 31
    xor     eax, edx
    sub     eax, edx
    ret

Clang:

myabs:
    mov     eax, edi    ; argument passed in EDI by System V AMD64 calling convention
    neg     eax
    cmovl   eax, edi
    ret

libabs:
    mov     eax, edi    ; argument passed in EDI by System V AMD64 calling convention
    neg     eax
    cmovl   eax, edi
    ret

Visual Studio (MSVC):

libabs:
    mov      eax, ecx    ; argument passed in ECX by Windows 64-bit calling convention 
    cdq
    xor      eax, edx
    sub      eax, edx
    ret      0

myabs:
    mov      eax, ecx    ; argument passed in ECX by Windows 64-bit calling convention 
    cdq
    xor      eax, edx
    sub      eax, edx
    ret      0

ICC:

myabs:
    mov       eax, edi    ; argument passed in EDI by System V AMD64 calling convention 
    cdq
    xor       edi, edx
    sub       edi, edx
    mov       eax, edi
    ret      

libabs:
    mov       eax, edi    ; argument passed in EDI by System V AMD64 calling convention 
    cdq
    xor       edi, edx
    sub       edi, edx
    mov       eax, edi
    ret

See for yourself on Godbolt Compiler Explorer, where you can inspect the machine code generated by various compilers. (Link kindly provided by Peter Cordes.)

170

answered Oct 24 '22 06:10

geza

Your abs performs branching based on a condition. While the built-in variant just removes the sign bit from the integer, most likely using just a couple of instructions. Possible assembly example (taken from here):

cdq
xor eax, edx
sub eax, edx

The cdq copies the sign of the register eax to register edx. For example, if it is a positive number, edx will be zero, otherwise, edx will be 0xFFFFFF which denotes -1. The xor operation with the origin number will change nothing if it is a positive number (any number xor 0 will not change). However, when eax is negative, eax xor 0xFFFFFF yields (not eax). The final step is to subtract edx from eax. Again, if eax is positive, edx is zero, and the final value is still the same. For negative values, (~ eax) – (-1) = –eax which is the value wanted.

As you can see this approach uses only three simple arithmetic instructions and no conditional branching at all.

Edit: After some research it turned out that many built-in implementations of abs use the same approach, return __x >= 0 ? __x : -__x;, and such a pattern is an obvious target for compiler optimization to avoid unnecessary branching.

However, that does not justify the use of custom abs implementation as it violates the DRY principle and no one can guarantee that your implementation is going to be just as good for more sophisticated scenarios and/or unusual platforms. Typically one should think about rewriting some of the library functions only when there is a definite performance problem or some other defect detected in existing implementation.

Edit2: Just switching from int to float shows considerable performance degradation:

float libfoo(float x)
{
    return ::std::fabs(x);
}

andps   xmm0, xmmword ptr [rip + .LCPI0_0]

And a custom version:

inline float my_fabs(float x)
{
    return x>0.0f?x:-x;
}

float myfoo(float x)
{
    return my_fabs(x);
}

movaps  xmm1, xmmword ptr [rip + .LCPI1_0] # xmm1 = [-0.000000e+00,-0.000000e+00,-0.000000e+00,-0.000000e+00]
xorps   xmm1, xmm0
xorps   xmm2, xmm2
cmpltss xmm2, xmm0
andps   xmm0, xmm2
andnps  xmm2, xmm1
orps    xmm0, xmm2

online compiler

answered Oct 24 '22 07:10

user7860670

Related questions
                            
                                How to easily inspect styled-components using dev tools?
                            
                                Detect swipe left in React Native
                            
                                Does Microsoft OLE DB Provider for SQL Server support TLS 1.2
                            
                                MatDialog not displayed correctly
                            
                                How to confirm user in Cognito User Pools without verifying email or phone?
                            
                                Integrating Spark Structured Streaming with the Confluent Schema Registry
                            
                                Run Create-React-App Tests not in Watch Mode
                            
                                android 7.1.2 + ARMv7
                            
                                Vue - access nested childs using ref
                            
                                matTooltipClass not applying css
                            
                                Using routerLink and (click) in same button
                            
                                iTunes connect error "To use this application, you must first sign in to iTunes Connect and sign the relevant contracts."

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With