Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performance of C++11 modern-style loops vs old-style loops

This is the first question I'm posting here, so I hope I won't do anything wrong.

My question concerns the performance of modern-style C++11 loops (std::for_each, range-based for) vs old-style C++ loops (for (...; ...; ...)). From what I understood, it seems to me that the motto of modern C++ is "expressivity with no compromise on performance". Modern C++ style leads to safe, clean, and fast code with little to no performance penalty and, possibly, with a performance gain over old-style C++.

Now I've made a little test to assess how big this gain is concerning loops. First I wrote the following three functions:

using namespace std;

void foo(vector<double>& v)
{
    for (size_t i = 0; i < v.size(); i++)
    {
        v[i] /= 42;
    }
}

void bar(vector<double>& v)
{
    for (auto& x : v)
    {
        x /= 42;
    }
}

void wee(vector<double>& v)
{
    for_each(begin(v), end(v), [] (double& x)
    {
        x /= 42;
    });
}

Then I compared their performance by calling them this way (properly commenting/uncommenting the three lines inside main()'s loop:

vector<double> make_vector()
{
    vector<double> v;
    for (int i = 0; i < 30000; i++) { v.push_back(i); }
    return v;
}

int main()
{
    time_t start = clock();

    auto v = make_vector();
    for (int i = 0; i <= 50000; i++) 
    { 
        // UNCOMMENT THE FUNCTION CALL TO BE TESTED, COMMENT THE OTHERS

        foo(v);
        // bar(v); 
        // wee(v);
    }

    time_t end = clock();
    cout << (end - start) << endl;

    return 0;
}

Averaging over 10 executions of each version of the program obtained by commenting/uncommenting the lines in main()'s loop, and using the old-style loop as a baseline, the range-based for loop performs ~1.9x worse, and the loop based on std::for_each and lambdas performs ~2.3x worse.

I used Clang 3.2 to compile this, and I haven't tried MS VC11 (I'm working on WinXP).

Considering my expectation of getting comparable execution times, my questions are:

  1. Did I do something obviously wrong?
  2. If not, couldn't a 2x performance penalty be a good reason NOT to embrace modern-style loops?

I would like to remark, that I do believe that the clarity and safety of code written in modern C++ style pay off for a possible performance loss, but I quite disagree with the statement that there is no trade-off between clarity/safety on one side and performance on the other side.

Am I missing something?

like image 794
Andy Prowl Avatar asked Dec 27 '12 14:12

Andy Prowl


People also ask

Are range-based for loops faster?

Range-for is as fast as possible since it caches the end iterator[citationprovided], uses pre-increment and only dereferences the iterator once.

Why use range-based for loop?

Range-based for loop (since C++11) Executes a for loop over a range. Used as a more readable equivalent to the traditional for loop operating over a range of values, such as all elements in a container.

When can you use range-based for loop C++?

Use the range-based for statement to construct loops that must execute through a range, which is defined as anything that you can iterate through—for example, std::vector , or any other C++ Standard Library sequence whose range is defined by a begin() and end() .

What does colon mean in for loop C++?

Colon ':' is basically a operator which indicates that the y is inside x and assign the y's value to the elements of array one by one.


2 Answers

It looks like the difference only shows up when you do not enable optimisations in your compiler.

With Clang you can enable optimisation with the -O[0-3] flag.

like image 150
Mankarse Avatar answered Nov 15 '22 15:11

Mankarse


Mankarse is right - most likely you have not enabled optimizations.

Actually on Clang they produce practically same result ASM code in main loop, and small difference in pre/post code.

I have tested four versions: hand_loop_index, hand_loop_iterator, range_based_for, for_each_algorithm

hand_loop_iterator, range_based_for and for_each_algorithm - all three do produce exactly same result ASM for full function body, only difference is in names of labels.

I.e. hand written for loop with iterators results in exactly same ASM code as range-based-for and std::for_each.

There are some differences between loop with index and loop with iterator versions.

Main loop in both cases is almost same. The only minor differece is that for iterators version(s) rdx register is used instead of rsi.

Index version:

.LBB0_7:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
    movupd  -48(%rsi), %xmm1
    movupd  -32(%rsi), %xmm2
    divpd   %xmm0, %xmm1
    divpd   %xmm0, %xmm2
    movupd  %xmm1, -48(%rsi)
    movupd  %xmm2, -32(%rsi)
    movupd  -16(%rsi), %xmm1
    movupd  (%rsi), %xmm2
    divpd   %xmm0, %xmm1
    divpd   %xmm0, %xmm2
    movupd  %xmm1, -16(%rsi)
    movupd  %xmm2, (%rsi)
    addq    $64, %rsi
    addq    $-8, %rdi
    jne .LBB0_7

Iterator version(s):

.LBB1_6:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
    movupd  -48(%rdx), %xmm1
    movupd  -32(%rdx), %xmm2
    divpd   %xmm0, %xmm1
    divpd   %xmm0, %xmm2
    movupd  %xmm1, -48(%rdx)
    movupd  %xmm2, -32(%rdx)
    movupd  -16(%rdx), %xmm1
    movupd  (%rdx), %xmm2
    divpd   %xmm0, %xmm1
    divpd   %xmm0, %xmm2
    movupd  %xmm1, -16(%rdx)
    movupd  %xmm2, (%rdx)
    addq    $64, %rdx
    addq    $-8, %rsi
    jne .LBB1_6

Pre/post code for index vs iterator versions has many differences, but it should not affect greatly total result timing for large enough arrays.

LIVE DEMO on Coliru with ASM output

#include <algorithm>
#include <iterator>
#include <vector>

using namespace std;

void hand_loop_index(vector<double> &v)
{
    for (size_t i = 0; i < v.size(); ++i)
    {
        v[i] /= 42;
    }
}

void hand_loop_iterator(vector<double> &v)
{
    for (auto first = begin(v), last = end(v); first!=last; ++first)
    {
        *first /= 42;
    }
}

void range_based_for(vector<double> &v)
{
    for (auto &x : v)
    {
        x /= 42;
    }
}

void for_each_algorithm(vector<double> &v)
{
    for_each(begin(v), end(v), [] (double &x)
    {
        x /= 42;
    });
}

Result ASM:

# clang++ -std=c++1z -O3 -Wall -pedantic -pthread main.cpp -S
    .text
    .file   "main.cpp"
    .section    .rodata.cst16,"aM",@progbits,16
    .align  16
.LCPI0_0:
    .quad   4631107791820423168     # double 4.200000e+01
    .quad   4631107791820423168     # double 4.200000e+01
    .section    .rodata.cst8,"aM",@progbits,8
    .align  8
.LCPI0_1:
    .quad   4631107791820423168     # double 42
    .text
    .globl  _Z15hand_loop_indexRSt6vectorIdSaIdEE
    .align  16, 0x90
    .type   _Z15hand_loop_indexRSt6vectorIdSaIdEE,@function
_Z15hand_loop_indexRSt6vectorIdSaIdEE:  # @_Z15hand_loop_indexRSt6vectorIdSaIdEE
    .cfi_startproc
# BB#0:
    movq    (%rdi), %rax
    movq    8(%rdi), %rcx
    subq    %rax, %rcx
    je  .LBB0_11
# BB#1:                                 # %.lr.ph
    sarq    $3, %rcx
    cmpq    $1, %rcx
    movl    $1, %edx
    cmovaq  %rcx, %rdx
    xorl    %edi, %edi
    testq   %rdx, %rdx
    je  .LBB0_10
# BB#2:                                 # %overflow.checked
    xorl    %edi, %edi
    movq    %rdx, %r8
    andq    $-4, %r8
    je  .LBB0_9
# BB#3:                                 # %vector.body.preheader
    cmpq    $1, %rcx
    movl    $1, %edi
    cmovaq  %rcx, %rdi
    addq    $-4, %rdi
    movq    %rdi, %rsi
    shrq    $2, %rsi
    xorl    %r9d, %r9d
    btq $2, %rdi
    jb  .LBB0_5
# BB#4:                                 # %vector.body.prol
    movupd  (%rax), %xmm0
    movupd  16(%rax), %xmm1
    movapd  .LCPI0_0(%rip), %xmm2   # xmm2 = [4.200000e+01,4.200000e+01]
    divpd   %xmm2, %xmm0
    divpd   %xmm2, %xmm1
    movupd  %xmm0, (%rax)
    movupd  %xmm1, 16(%rax)
    movl    $4, %r9d
.LBB0_5:                                # %vector.body.preheader.split
    testq   %rsi, %rsi
    je  .LBB0_8
# BB#6:                                 # %vector.body.preheader.split.split
    cmpq    $1, %rcx
    movl    $1, %edi
    cmovaq  %rcx, %rdi
    andq    $-4, %rdi
    subq    %r9, %rdi
    leaq    48(%rax,%r9,8), %rsi
    movapd  .LCPI0_0(%rip), %xmm0   # xmm0 = [4.200000e+01,4.200000e+01]
    .align  16, 0x90
.LBB0_7:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
    movupd  -48(%rsi), %xmm1
    movupd  -32(%rsi), %xmm2
    divpd   %xmm0, %xmm1
    divpd   %xmm0, %xmm2
    movupd  %xmm1, -48(%rsi)
    movupd  %xmm2, -32(%rsi)
    movupd  -16(%rsi), %xmm1
    movupd  (%rsi), %xmm2
    divpd   %xmm0, %xmm1
    divpd   %xmm0, %xmm2
    movupd  %xmm1, -16(%rsi)
    movupd  %xmm2, (%rsi)
    addq    $64, %rsi
    addq    $-8, %rdi
    jne .LBB0_7
.LBB0_8:
    movq    %r8, %rdi
.LBB0_9:                                # %middle.block
    cmpq    %rdi, %rdx
    je  .LBB0_11
    .align  16, 0x90
.LBB0_10:                               # %scalar.ph
                                        # =>This Inner Loop Header: Depth=1
    movsd   (%rax,%rdi,8), %xmm0    # xmm0 = mem[0],zero
    divsd   .LCPI0_1(%rip), %xmm0
    movsd   %xmm0, (%rax,%rdi,8)
    incq    %rdi
    cmpq    %rcx, %rdi
    jb  .LBB0_10
.LBB0_11:                               # %._crit_edge
    retq
.Lfunc_end0:
    .size   _Z15hand_loop_indexRSt6vectorIdSaIdEE, .Lfunc_end0-_Z15hand_loop_indexRSt6vectorIdSaIdEE
    .cfi_endproc

.section    .rodata.cst16,"aM",@progbits,16
    .align  16
.LCPI1_0:
    .quad   4631107791820423168     # double 4.200000e+01
    .quad   4631107791820423168     # double 4.200000e+01
    .section    .rodata.cst8,"aM",@progbits,8
    .align  8
.LCPI1_1:
    .quad   4631107791820423168     # double 42
    .text
    .globl  _Z18hand_loop_iteratorRSt6vectorIdSaIdEE
    .align  16, 0x90
    .type   _Z18hand_loop_iteratorRSt6vectorIdSaIdEE,@function
_Z18hand_loop_iteratorRSt6vectorIdSaIdEE: # @_Z18hand_loop_iteratorRSt6vectorIdSaIdEE
    .cfi_startproc
# BB#0:
    movq    (%rdi), %rdx
    movq    8(%rdi), %rax
    cmpq    %rax, %rdx
    je  .LBB1_11
# BB#1:                                 # %.lr.ph.preheader
    movabsq $4611686018427387900, %rsi # imm = 0x3FFFFFFFFFFFFFFC
    leaq    -8(%rax), %rcx
    subq    %rdx, %rcx
    shrq    $3, %rcx
    incq    %rcx
    xorl    %edi, %edi
    movq    %rcx, %r9
    andq    %rsi, %r9
    je  .LBB1_8
# BB#2:                                 # %vector.body.preheader
    andq    %rcx, %rsi
    leaq    -4(%rsi), %rdi
    movq    %rdi, %r11
    shrq    $2, %r11
    xorl    %r10d, %r10d
    btq $2, %rdi
    jb  .LBB1_4
# BB#3:                                 # %vector.body.prol
    movupd  (%rdx), %xmm0
    movupd  16(%rdx), %xmm1
    movapd  .LCPI1_0(%rip), %xmm2   # xmm2 = [4.200000e+01,4.200000e+01]
    divpd   %xmm2, %xmm0
    divpd   %xmm2, %xmm1
    movupd  %xmm0, (%rdx)
    movupd  %xmm1, 16(%rdx)
    movl    $4, %r10d
.LBB1_4:                                # %vector.body.preheader.split
    leaq    (%rdx,%r9,8), %r8
    testq   %r11, %r11
    je  .LBB1_7
# BB#5:                                 # %vector.body.preheader.split.split
    subq    %r10, %rsi
    leaq    48(%rdx,%r10,8), %rdx
    movapd  .LCPI1_0(%rip), %xmm0   # xmm0 = [4.200000e+01,4.200000e+01]
    .align  16, 0x90
.LBB1_6:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
    movupd  -48(%rdx), %xmm1
    movupd  -32(%rdx), %xmm2
    divpd   %xmm0, %xmm1
    divpd   %xmm0, %xmm2
    movupd  %xmm1, -48(%rdx)
    movupd  %xmm2, -32(%rdx)
    movupd  -16(%rdx), %xmm1
    movupd  (%rdx), %xmm2
    divpd   %xmm0, %xmm1
    divpd   %xmm0, %xmm2
    movupd  %xmm1, -16(%rdx)
    movupd  %xmm2, (%rdx)
    addq    $64, %rdx
    addq    $-8, %rsi
    jne .LBB1_6
.LBB1_7:
    movq    %r8, %rdx
    movq    %r9, %rdi
.LBB1_8:                                # %middle.block
    cmpq    %rdi, %rcx
    je  .LBB1_11
# BB#9:
    movsd   .LCPI1_1(%rip), %xmm0   # xmm0 = mem[0],zero
    .align  16, 0x90
.LBB1_10:                               # %.lr.ph
                                        # =>This Inner Loop Header: Depth=1
    movsd   (%rdx), %xmm1           # xmm1 = mem[0],zero
    divsd   %xmm0, %xmm1
    movsd   %xmm1, (%rdx)
    addq    $8, %rdx
    cmpq    %rdx, %rax
    jne .LBB1_10
.LBB1_11:                               # %._crit_edge
    retq
.Lfunc_end1:
    .size   _Z18hand_loop_iteratorRSt6vectorIdSaIdEE, .Lfunc_end1-_Z18hand_loop_iteratorRSt6vectorIdSaIdEE
    .cfi_endproc

.section    .rodata.cst16,"aM",@progbits,16
    .align  16
.LCPI2_0:
    .quad   4631107791820423168     # double 4.200000e+01
    .quad   4631107791820423168     # double 4.200000e+01
    .section    .rodata.cst8,"aM",@progbits,8
    .align  8
.LCPI2_1:
    .quad   4631107791820423168     # double 42
    .text
    .globl  _Z15range_based_forRSt6vectorIdSaIdEE
    .align  16, 0x90
    .type   _Z15range_based_forRSt6vectorIdSaIdEE,@function
_Z15range_based_forRSt6vectorIdSaIdEE:  # @_Z15range_based_forRSt6vectorIdSaIdEE
    .cfi_startproc
# BB#0:
    movq    (%rdi), %rdx
    movq    8(%rdi), %rax
    cmpq    %rax, %rdx
    je  .LBB2_11
# BB#1:                                 # %.lr.ph.preheader
    movabsq $4611686018427387900, %rsi # imm = 0x3FFFFFFFFFFFFFFC
    leaq    -8(%rax), %rcx
    subq    %rdx, %rcx
    shrq    $3, %rcx
    incq    %rcx
    xorl    %edi, %edi
    movq    %rcx, %r9
    andq    %rsi, %r9
    je  .LBB2_8
# BB#2:                                 # %vector.body.preheader
    andq    %rcx, %rsi
    leaq    -4(%rsi), %rdi
    movq    %rdi, %r11
    shrq    $2, %r11
    xorl    %r10d, %r10d
    btq $2, %rdi
    jb  .LBB2_4
# BB#3:                                 # %vector.body.prol
    movupd  (%rdx), %xmm0
    movupd  16(%rdx), %xmm1
    movapd  .LCPI2_0(%rip), %xmm2   # xmm2 = [4.200000e+01,4.200000e+01]
    divpd   %xmm2, %xmm0
    divpd   %xmm2, %xmm1
    movupd  %xmm0, (%rdx)
    movupd  %xmm1, 16(%rdx)
    movl    $4, %r10d
.LBB2_4:                                # %vector.body.preheader.split
    leaq    (%rdx,%r9,8), %r8
    testq   %r11, %r11
    je  .LBB2_7
# BB#5:                                 # %vector.body.preheader.split.split
    subq    %r10, %rsi
    leaq    48(%rdx,%r10,8), %rdx
    movapd  .LCPI2_0(%rip), %xmm0   # xmm0 = [4.200000e+01,4.200000e+01]
    .align  16, 0x90
.LBB2_6:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
    movupd  -48(%rdx), %xmm1
    movupd  -32(%rdx), %xmm2
    divpd   %xmm0, %xmm1
    divpd   %xmm0, %xmm2
    movupd  %xmm1, -48(%rdx)
    movupd  %xmm2, -32(%rdx)
    movupd  -16(%rdx), %xmm1
    movupd  (%rdx), %xmm2
    divpd   %xmm0, %xmm1
    divpd   %xmm0, %xmm2
    movupd  %xmm1, -16(%rdx)
    movupd  %xmm2, (%rdx)
    addq    $64, %rdx
    addq    $-8, %rsi
    jne .LBB2_6
.LBB2_7:
    movq    %r8, %rdx
    movq    %r9, %rdi
.LBB2_8:                                # %middle.block
    cmpq    %rdi, %rcx
    je  .LBB2_11
# BB#9:
    movsd   .LCPI2_1(%rip), %xmm0   # xmm0 = mem[0],zero
    .align  16, 0x90
.LBB2_10:                               # %.lr.ph
                                        # =>This Inner Loop Header: Depth=1
    movsd   (%rdx), %xmm1           # xmm1 = mem[0],zero
    divsd   %xmm0, %xmm1
    movsd   %xmm1, (%rdx)
    addq    $8, %rdx
    cmpq    %rdx, %rax
    jne .LBB2_10
.LBB2_11:                               # %._crit_edge
    retq
.Lfunc_end2:
    .size   _Z15range_based_forRSt6vectorIdSaIdEE, .Lfunc_end2-_Z15range_based_forRSt6vectorIdSaIdEE
    .cfi_endproc

.section    .rodata.cst16,"aM",@progbits,16
    .align  16
.LCPI3_0:
    .quad   4631107791820423168     # double 4.200000e+01
    .quad   4631107791820423168     # double 4.200000e+01
    .section    .rodata.cst8,"aM",@progbits,8
    .align  8
.LCPI3_1:
    .quad   4631107791820423168     # double 42
    .text
    .globl  _Z18for_each_algorithmRSt6vectorIdSaIdEE
    .align  16, 0x90
    .type   _Z18for_each_algorithmRSt6vectorIdSaIdEE,@function
_Z18for_each_algorithmRSt6vectorIdSaIdEE: # @_Z18for_each_algorithmRSt6vectorIdSaIdEE
    .cfi_startproc
# BB#0:
    movq    (%rdi), %rdx
    movq    8(%rdi), %rax
    cmpq    %rax, %rdx
    je  .LBB3_11
# BB#1:                                 # %.lr.ph.i.preheader
    movabsq $4611686018427387900, %rsi # imm = 0x3FFFFFFFFFFFFFFC
    leaq    -8(%rax), %rcx
    subq    %rdx, %rcx
    shrq    $3, %rcx
    incq    %rcx
    xorl    %edi, %edi
    movq    %rcx, %r9
    andq    %rsi, %r9
    je  .LBB3_8
# BB#2:                                 # %vector.body.preheader
    andq    %rcx, %rsi
    leaq    -4(%rsi), %rdi
    movq    %rdi, %r11
    shrq    $2, %r11
    xorl    %r10d, %r10d
    btq $2, %rdi
    jb  .LBB3_4
# BB#3:                                 # %vector.body.prol
    movupd  (%rdx), %xmm0
    movupd  16(%rdx), %xmm1
    movapd  .LCPI3_0(%rip), %xmm2   # xmm2 = [4.200000e+01,4.200000e+01]
    divpd   %xmm2, %xmm0
    divpd   %xmm2, %xmm1
    movupd  %xmm0, (%rdx)
    movupd  %xmm1, 16(%rdx)
    movl    $4, %r10d
.LBB3_4:                                # %vector.body.preheader.split
    leaq    (%rdx,%r9,8), %r8
    testq   %r11, %r11
    je  .LBB3_7
# BB#5:                                 # %vector.body.preheader.split.split
    subq    %r10, %rsi
    leaq    48(%rdx,%r10,8), %rdx
    movapd  .LCPI3_0(%rip), %xmm0   # xmm0 = [4.200000e+01,4.200000e+01]
    .align  16, 0x90
.LBB3_6:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
    movupd  -48(%rdx), %xmm1
    movupd  -32(%rdx), %xmm2
    divpd   %xmm0, %xmm1
    divpd   %xmm0, %xmm2
    movupd  %xmm1, -48(%rdx)
    movupd  %xmm2, -32(%rdx)
    movupd  -16(%rdx), %xmm1
    movupd  (%rdx), %xmm2
    divpd   %xmm0, %xmm1
    divpd   %xmm0, %xmm2
    movupd  %xmm1, -16(%rdx)
    movupd  %xmm2, (%rdx)
    addq    $64, %rdx
    addq    $-8, %rsi
    jne .LBB3_6
.LBB3_7:
    movq    %r8, %rdx
    movq    %r9, %rdi
.LBB3_8:                                # %middle.block
    cmpq    %rdi, %rcx
    je  .LBB3_11
# BB#9:
    movsd   .LCPI3_1(%rip), %xmm0   # xmm0 = mem[0],zero
    .align  16, 0x90
.LBB3_10:                               # %.lr.ph.i
                                        # =>This Inner Loop Header: Depth=1
    movsd   (%rdx), %xmm1           # xmm1 = mem[0],zero
    divsd   %xmm0, %xmm1
    movsd   %xmm1, (%rdx)
    addq    $8, %rdx
    cmpq    %rdx, %rax
    jne .LBB3_10
.LBB3_11:                               # %_ZSt8for_eachIN9__gnu_cxx17__normal_iteratorIPdSt6vectorIdSaIdEEEEZ18for_each_algorithmR5_E3$_0ET0_T_SA_S9_.exit
    retq
.Lfunc_end3:
    .size   _Z18for_each_algorithmRSt6vectorIdSaIdEE, .Lfunc_end3-_Z18for_each_algorithmRSt6vectorIdSaIdEE
    .cfi_endproc

    .ident  "clang version 3.7.0 (tags/RELEASE_370/final 246979)"
    .section    ".note.GNU-stack","",@progbits
like image 25
Evgeny Panasyuk Avatar answered Nov 15 '22 15:11

Evgeny Panasyuk