why is std::equal much slower than a hand rolled loop for two small std::array?

Tags:

I was profiling a small piece of code that is part of a larger simulation, and to my surprise, the STL function equal (std::equal) is much slower than a simple for-loop, comparing the two arrays element by element. I wrote a small test case, which I believe to be a fair comparison between the two, and the difference, using g++ 6.1.1 from the Debian archives is not insignificant. I am comparing two, four-element arrays of signed integers. I tested std::equal, operator==, and a small for loop. I didn't use std::chrono for an exact timing, but the difference can be seen explicitly with time ./a.out.

My question is, given the sample code below, why does operator== and the overloaded function std::equal (which calls operator== I believe) take approx 40s to complete, and the hand written loop take only 8s? I'm using a very recent intel based laptop. The for-loop is faster on all optimizations levels, -O1, -O2, -O3, and -Ofast. I compiled the code with g++ -std=c++14 -Ofast -march=native -mtune=native

Run the code

The loop runs a huge number of times, just to make the difference clear to the naked eye. The modulo operators represent a cheap operation on one of the array elements, and serve to keep the compiler from optimizing out of the loop.

#include<iostream>
#include<algorithm>
#include<array>

using namespace std;
using T = array<int32_t, 4>;

bool 
are_equal_manual(const T& L, const T& R)
noexcept {
    bool test{ true };
    for(uint32_t i{0}; i < 4; ++i) { test = test && (L[i] == R[i]); }
    return test;
}

bool
are_equal_alg(const T& L, const T& R)
noexcept {
    bool test{ equal(cbegin(L),cend(L),cbegin(R)) };
    return test;
}

int main(int argc, char** argv) {

    T left{ {0,1,2,3} };
    T right{ {0,1,2,3} };

    cout << boolalpha << are_equal_manual(left,right) << endl;
    cout << boolalpha << are_equal_alg(left,right) << endl;
    cout << boolalpha << (left == right) << endl;

    bool t{};
    const size_t N{ 5000000000 };
    for(size_t i{}; i < N; ++i) {
      //t = left == right; // SLOW
      //t = are_equal_manual(left,right); // FAST
        t = are_equal_alg(left,right);  // SLOW
      left[0] = i % 10;
      right[2] = i % 8;
    }

    cout<< boolalpha << t << endl;

    return(EXIT_SUCCESS);
}

619

asked Sep 01 '16 03:09

KBentley57

1 Answers

Here's the generated assembly of the for loop in main() when the are_equal_manual(left,right) function is used:

.L21:
        xor     esi, esi
        test    eax, eax
        jne     .L20
        cmp     edx, 2
        sete    sil
.L20:
        mov     rax, rcx
        movzx   esi, sil
        mul     r8
        shr     rdx, 3
        lea     rax, [rdx+rdx*4]
        mov     edx, ecx
        add     rax, rax
        sub     edx, eax
        mov     eax, edx
        mov     edx, ecx
        add     rcx, 1
        and     edx, 7
        cmp     rcx, rdi

And here's what's generated when the are_equal_alg(left,right) function is used:

.L20:
        lea     rsi, [rsp+16]
        mov     edx, 16
        mov     rdi, rsp
        call    memcmp
        mov     ecx, eax
        mov     rax, rbx
        mov     rdi, rbx
        mul     r12
        shr     rdx, 3
        lea     rax, [rdx+rdx*4]
        add     rax, rax
        sub     rdi, rax
        mov     eax, ebx
        add     rbx, 1
        and     eax, 7
        cmp     rbx, rbp
        mov     DWORD PTR [rsp], edi
        mov     DWORD PTR [rsp+24], eax
        jne     .L20

I'm not exactly sure what's happening in the generated code for first case, but it's clearly not calling memcmp(). It doesn't appear to be comparing the contents of the arrays at all. While the loop is still being iterated 5000000000 times, it's optimized to doing nothing much. However, the loop that uses are_equal_alg(left,right) is still performing the comparison. Basically, the compiler is still able to optimize the manual comparison much better than the std::equal template.

189

answered Sep 18 '22 19:09

Michael Burr

Related questions
                            
                                Custom window frame behaving differently across qt builds (ANGLE vs OpenGL)
                            
                                Understanding C++ function Inlining
                            
                                Something about a completely empty class
                            
                                How do I make my iterator classes not look like container classes?
                            
                                Reducing time complexity in maximal minimum-sum 2-partitioning of an array
                            
                                Template friend function and return type deduction
                            
                                Qt Check platform type : Mobile or Desktop
                            
                                Display pointer as array in Qt Creator with CDB for debugger
                            
                                How to find if a function is reentrant
                            
                                Implementation of dynamic initialization for global variables and static member variables in C++
                            
                                Defining out-of-line member template functions
                            
                                C++: why this simple Scope Guard works?
                            
                                C++ equivalent of perror?
                            
                                Why is std::bitset::size non-static
                            
                                Large PCIe DMA Linux x86-64
                            
                                In C11/C++11, possible to mix atomic/non-atomic ops on the same memory?
                            
                                Unable to catch exception from boost::asio::io_service::run
                            
                                Why does operator>> on complex<double> not set eofbit if it reaches EOF?
                            
                                How can I control frame rate in Qt 3D?
                            
                                Can std::atomic cancel out increments with decrements?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

why is std::equal much slower than a hand rolled loop for two small std::array?

Tags:

c++

performance

stl

c++14

gcc6

KBentley57

People also ask

1 Answers

Michael Burr

Recent Activity

Donate For Us