Performance of C++11 modern-style loops vs old-style loops

Tags:

This is the first question I'm posting here, so I hope I won't do anything wrong.

My question concerns the performance of modern-style C++11 loops (std::for_each, range-based for) vs old-style C++ loops (for (...; ...; ...)). From what I understood, it seems to me that the motto of modern C++ is "expressivity with no compromise on performance". Modern C++ style leads to safe, clean, and fast code with little to no performance penalty and, possibly, with a performance gain over old-style C++.

Now I've made a little test to assess how big this gain is concerning loops. First I wrote the following three functions:

using namespace std;

void foo(vector<double>& v)
{
    for (size_t i = 0; i < v.size(); i++)
    {
        v[i] /= 42;
    }
}

void bar(vector<double>& v)
{
    for (auto& x : v)
    {
        x /= 42;
    }
}

void wee(vector<double>& v)
{
    for_each(begin(v), end(v), [] (double& x)
    {
        x /= 42;
    });
}

Then I compared their performance by calling them this way (properly commenting/uncommenting the three lines inside main()'s loop:

vector<double> make_vector()
{
    vector<double> v;
    for (int i = 0; i < 30000; i++) { v.push_back(i); }
    return v;
}

int main()
{
    time_t start = clock();

    auto v = make_vector();
    for (int i = 0; i <= 50000; i++) 
    { 
        // UNCOMMENT THE FUNCTION CALL TO BE TESTED, COMMENT THE OTHERS

        foo(v);
        // bar(v); 
        // wee(v);
    }

    time_t end = clock();
    cout << (end - start) << endl;

    return 0;
}

Averaging over 10 executions of each version of the program obtained by commenting/uncommenting the lines in main()'s loop, and using the old-style loop as a baseline, the range-based for loop performs ~1.9x worse, and the loop based on std::for_each and lambdas performs ~2.3x worse.

I used Clang 3.2 to compile this, and I haven't tried MS VC11 (I'm working on WinXP).

Considering my expectation of getting comparable execution times, my questions are:

Did I do something obviously wrong?
If not, couldn't a 2x performance penalty be a good reason NOT to embrace modern-style loops?

I would like to remark, that I do believe that the clarity and safety of code written in modern C++ style pay off for a possible performance loss, but I quite disagree with the statement that there is no trade-off between clarity/safety on one side and performance on the other side.

Am I missing something?

794

asked Dec 27 '12 14:12

Andy Prowl

2 Answers

It looks like the difference only shows up when you do not enable optimisations in your compiler.

With Clang you can enable optimisation with the -O[0-3] flag.

150

answered Nov 15 '22 15:11

Mankarse

Mankarse is right - most likely you have not enabled optimizations.

Actually on Clang they produce practically same result ASM code in main loop, and small difference in pre/post code.

I have tested four versions: hand_loop_index, hand_loop_iterator, range_based_for, for_each_algorithm

hand_loop_iterator, range_based_for and for_each_algorithm - all three do produce exactly same result ASM for full function body, only difference is in names of labels.

I.e. hand written for loop with iterators results in exactly same ASM code as range-based-for and std::for_each.

There are some differences between loop with index and loop with iterator versions.

Main loop in both cases is almost same. The only minor differece is that for iterators version(s) rdx register is used instead of rsi.

Index version:

.LBB0_7:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
    movupd  -48(%rsi), %xmm1
    movupd  -32(%rsi), %xmm2
    divpd   %xmm0, %xmm1
    divpd   %xmm0, %xmm2
    movupd  %xmm1, -48(%rsi)
    movupd  %xmm2, -32(%rsi)
    movupd  -16(%rsi), %xmm1
    movupd  (%rsi), %xmm2
    divpd   %xmm0, %xmm1
    divpd   %xmm0, %xmm2
    movupd  %xmm1, -16(%rsi)
    movupd  %xmm2, (%rsi)
    addq    $64, %rsi
    addq    $-8, %rdi
    jne .LBB0_7

Iterator version(s):

.LBB1_6:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
    movupd  -48(%rdx), %xmm1
    movupd  -32(%rdx), %xmm2
    divpd   %xmm0, %xmm1
    divpd   %xmm0, %xmm2
    movupd  %xmm1, -48(%rdx)
    movupd  %xmm2, -32(%rdx)
    movupd  -16(%rdx), %xmm1
    movupd  (%rdx), %xmm2
    divpd   %xmm0, %xmm1
    divpd   %xmm0, %xmm2
    movupd  %xmm1, -16(%rdx)
    movupd  %xmm2, (%rdx)
    addq    $64, %rdx
    addq    $-8, %rsi
    jne .LBB1_6

Pre/post code for index vs iterator versions has many differences, but it should not affect greatly total result timing for large enough arrays.

LIVE DEMO on Coliru with ASM output

#include <algorithm>
#include <iterator>
#include <vector>

using namespace std;

void hand_loop_index(vector<double> &v)
{
    for (size_t i = 0; i < v.size(); ++i)
    {
        v[i] /= 42;
    }
}

void hand_loop_iterator(vector<double> &v)
{
    for (auto first = begin(v), last = end(v); first!=last; ++first)
    {
        *first /= 42;
    }
}

void range_based_for(vector<double> &v)
{
    for (auto &x : v)
    {
        x /= 42;
    }
}

void for_each_algorithm(vector<double> &v)
{
    for_each(begin(v), end(v), [] (double &x)
    {
        x /= 42;
    });
}

Result ASM:

# clang++ -std=c++1z -O3 -Wall -pedantic -pthread main.cpp -S
    .text
    .file   "main.cpp"
    .section    .rodata.cst16,"aM",@progbits,16
    .align  16
.LCPI0_0:
    .quad   4631107791820423168     # double 4.200000e+01
    .quad   4631107791820423168     # double 4.200000e+01
    .section    .rodata.cst8,"aM",@progbits,8
    .align  8
.LCPI0_1:
    .quad   4631107791820423168     # double 42
    .text
    .globl  _Z15hand_loop_indexRSt6vectorIdSaIdEE
    .align  16, 0x90
    .type   _Z15hand_loop_indexRSt6vectorIdSaIdEE,@function
_Z15hand_loop_indexRSt6vectorIdSaIdEE:  # @_Z15hand_loop_indexRSt6vectorIdSaIdEE
    .cfi_startproc
# BB#0:
    movq    (%rdi), %rax
    movq    8(%rdi), %rcx
    subq    %rax, %rcx
    je  .LBB0_11
# BB#1:                                 # %.lr.ph
    sarq    $3, %rcx
    cmpq    $1, %rcx
    movl    $1, %edx
    cmovaq  %rcx, %rdx
    xorl    %edi, %edi
    testq   %rdx, %rdx
    je  .LBB0_10
# BB#2:                                 # %overflow.checked
    xorl    %edi, %edi
    movq    %rdx, %r8
    andq    $-4, %r8
    je  .LBB0_9
# BB#3:                                 # %vector.body.preheader
    cmpq    $1, %rcx
    movl    $1, %edi
    cmovaq  %rcx, %rdi
    addq    $-4, %rdi
    movq    %rdi, %rsi
    shrq    $2, %rsi
    xorl    %r9d, %r9d
    btq $2, %rdi
    jb  .LBB0_5
# BB#4:                                 # %vector.body.prol
    movupd  (%rax), %xmm0
    movupd  16(%rax), %xmm1
    movapd  .LCPI0_0(%rip), %xmm2   # xmm2 = [4.200000e+01,4.200000e+01]
    divpd   %xmm2, %xmm0
    divpd   %xmm2, %xmm1
    movupd  %xmm0, (%rax)
    movupd  %xmm1, 16(%rax)
    movl    $4, %r9d
.LBB0_5:                                # %vector.body.preheader.split
    testq   %rsi, %rsi
    je  .LBB0_8
# BB#6:                                 # %vector.body.preheader.split.split
    cmpq    $1, %rcx
    movl    $1, %edi
    cmovaq  %rcx, %rdi
    andq    $-4, %rdi
    subq    %r9, %rdi
    leaq    48(%rax,%r9,8), %rsi
    movapd  .LCPI0_0(%rip), %xmm0   # xmm0 = [4.200000e+01,4.200000e+01]
    .align  16, 0x90
.LBB0_7:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
    movupd  -48(%rsi), %xmm1
    movupd  -32(%rsi), %xmm2
    divpd   %xmm0, %xmm1
    divpd   %xmm0, %xmm2
    movupd  %xmm1, -48(%rsi)
    movupd  %xmm2, -32(%rsi)
    movupd  -16(%rsi), %xmm1
    movupd  (%rsi), %xmm2
    divpd   %xmm0, %xmm1
    divpd   %xmm0, %xmm2
    movupd  %xmm1, -16(%rsi)
    movupd  %xmm2, (%rsi)
    addq    $64, %rsi
    addq    $-8, %rdi
    jne .LBB0_7
.LBB0_8:
    movq    %r8, %rdi
.LBB0_9:                                # %middle.block
    cmpq    %rdi, %rdx
    je  .LBB0_11
    .align  16, 0x90
.LBB0_10:                               # %scalar.ph
                                        # =>This Inner Loop Header: Depth=1
    movsd   (%rax,%rdi,8), %xmm0    # xmm0 = mem[0],zero
    divsd   .LCPI0_1(%rip), %xmm0
    movsd   %xmm0, (%rax,%rdi,8)
    incq    %rdi
    cmpq    %rcx, %rdi
    jb  .LBB0_10
.LBB0_11:                               # %._crit_edge
    retq
.Lfunc_end0:
    .size   _Z15hand_loop_indexRSt6vectorIdSaIdEE, .Lfunc_end0-_Z15hand_loop_indexRSt6vectorIdSaIdEE
    .cfi_endproc

.section    .rodata.cst16,"aM",@progbits,16
    .align  16
.LCPI1_0:
    .quad   4631107791820423168     # double 4.200000e+01
    .quad   4631107791820423168     # double 4.200000e+01
    .section    .rodata.cst8,"aM",@progbits,8
    .align  8
.LCPI1_1:
    .quad   4631107791820423168     # double 42
    .text
    .globl  _Z18hand_loop_iteratorRSt6vectorIdSaIdEE
    .align  16, 0x90
    .type   _Z18hand_loop_iteratorRSt6vectorIdSaIdEE,@function
_Z18hand_loop_iteratorRSt6vectorIdSaIdEE: # @_Z18hand_loop_iteratorRSt6vectorIdSaIdEE
    .cfi_startproc
# BB#0:
    movq    (%rdi), %rdx
    movq    8(%rdi), %rax
    cmpq    %rax, %rdx
    je  .LBB1_11
# BB#1:                                 # %.lr.ph.preheader
    movabsq $4611686018427387900, %rsi # imm = 0x3FFFFFFFFFFFFFFC
    leaq    -8(%rax), %rcx
    subq    %rdx, %rcx
    shrq    $3, %rcx
    incq    %rcx
    xorl    %edi, %edi
    movq    %rcx, %r9
    andq    %rsi, %r9
    je  .LBB1_8
# BB#2:                                 # %vector.body.preheader
    andq    %rcx, %rsi
    leaq    -4(%rsi), %rdi
    movq    %rdi, %r11
    shrq    $2, %r11
    xorl    %r10d, %r10d
    btq $2, %rdi
    jb  .LBB1_4
# BB#3:                                 # %vector.body.prol
    movupd  (%rdx), %xmm0
    movupd  16(%rdx), %xmm1
    movapd  .LCPI1_0(%rip), %xmm2   # xmm2 = [4.200000e+01,4.200000e+01]
    divpd   %xmm2, %xmm0
    divpd   %xmm2, %xmm1
    movupd  %xmm0, (%rdx)
    movupd  %xmm1, 16(%rdx)
    movl    $4, %r10d
.LBB1_4:                                # %vector.body.preheader.split
    leaq    (%rdx,%r9,8), %r8
    testq   %r11, %r11
    je  .LBB1_7
# BB#5:                                 # %vector.body.preheader.split.split
    subq    %r10, %rsi
    leaq    48(%rdx,%r10,8), %rdx
    movapd  .LCPI1_0(%rip), %xmm0   # xmm0 = [4.200000e+01,4.200000e+01]
    .align  16, 0x90
.LBB1_6:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
    movupd  -48(%rdx), %xmm1
    movupd  -32(%rdx), %xmm2
    divpd   %xmm0, %xmm1
    divpd   %xmm0, %xmm2
    movupd  %xmm1, -48(%rdx)
    movupd  %xmm2, -32(%rdx)
    movupd  -16(%rdx), %xmm1
    movupd  (%rdx), %xmm2
    divpd   %xmm0, %xmm1
    divpd   %xmm0, %xmm2
    movupd  %xmm1, -16(%rdx)
    movupd  %xmm2, (%rdx)
    addq    $64, %rdx
    addq    $-8, %rsi
    jne .LBB1_6
.LBB1_7:
    movq    %r8, %rdx
    movq    %r9, %rdi
.LBB1_8:                                # %middle.block
    cmpq    %rdi, %rcx
    je  .LBB1_11
# BB#9:
    movsd   .LCPI1_1(%rip), %xmm0   # xmm0 = mem[0],zero
    .align  16, 0x90
.LBB1_10:                               # %.lr.ph
                                        # =>This Inner Loop Header: Depth=1
    movsd   (%rdx), %xmm1           # xmm1 = mem[0],zero
    divsd   %xmm0, %xmm1
    movsd   %xmm1, (%rdx)
    addq    $8, %rdx
    cmpq    %rdx, %rax
    jne .LBB1_10
.LBB1_11:                               # %._crit_edge
    retq
.Lfunc_end1:
    .size   _Z18hand_loop_iteratorRSt6vectorIdSaIdEE, .Lfunc_end1-_Z18hand_loop_iteratorRSt6vectorIdSaIdEE
    .cfi_endproc

.section    .rodata.cst16,"aM",@progbits,16
    .align  16
.LCPI2_0:
    .quad   4631107791820423168     # double 4.200000e+01
    .quad   4631107791820423168     # double 4.200000e+01
    .section    .rodata.cst8,"aM",@progbits,8
    .align  8
.LCPI2_1:
    .quad   4631107791820423168     # double 42
    .text
    .globl  _Z15range_based_forRSt6vectorIdSaIdEE
    .align  16, 0x90
    .type   _Z15range_based_forRSt6vectorIdSaIdEE,@function
_Z15range_based_forRSt6vectorIdSaIdEE:  # @_Z15range_based_forRSt6vectorIdSaIdEE
    .cfi_startproc
# BB#0:
    movq    (%rdi), %rdx
    movq    8(%rdi), %rax
    cmpq    %rax, %rdx
    je  .LBB2_11
# BB#1:                                 # %.lr.ph.preheader
    movabsq $4611686018427387900, %rsi # imm = 0x3FFFFFFFFFFFFFFC
    leaq    -8(%rax), %rcx
    subq    %rdx, %rcx
    shrq    $3, %rcx
    incq    %rcx
    xorl    %edi, %edi
    movq    %rcx, %r9
    andq    %rsi, %r9
    je  .LBB2_8
# BB#2:                                 # %vector.body.preheader
    andq    %rcx, %rsi
    leaq    -4(%rsi), %rdi
    movq    %rdi, %r11
    shrq    $2, %r11
    xorl    %r10d, %r10d
    btq $2, %rdi
    jb  .LBB2_4
# BB#3:                                 # %vector.body.prol
    movupd  (%rdx), %xmm0
    movupd  16(%rdx), %xmm1
    movapd  .LCPI2_0(%rip), %xmm2   # xmm2 = [4.200000e+01,4.200000e+01]
    divpd   %xmm2, %xmm0
    divpd   %xmm2, %xmm1
    movupd  %xmm0, (%rdx)
    movupd  %xmm1, 16(%rdx)
    movl    $4, %r10d
.LBB2_4:                                # %vector.body.preheader.split
    leaq    (%rdx,%r9,8), %r8
    testq   %r11, %r11
    je  .LBB2_7
# BB#5:                                 # %vector.body.preheader.split.split
    subq    %r10, %rsi
    leaq    48(%rdx,%r10,8), %rdx
    movapd  .LCPI2_0(%rip), %xmm0   # xmm0 = [4.200000e+01,4.200000e+01]
    .align  16, 0x90
.LBB2_6:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
    movupd  -48(%rdx), %xmm1
    movupd  -32(%rdx), %xmm2
    divpd   %xmm0, %xmm1
    divpd   %xmm0, %xmm2
    movupd  %xmm1, -48(%rdx)
    movupd  %xmm2, -32(%rdx)
    movupd  -16(%rdx), %xmm1
    movupd  (%rdx), %xmm2
    divpd   %xmm0, %xmm1
    divpd   %xmm0, %xmm2
    movupd  %xmm1, -16(%rdx)
    movupd  %xmm2, (%rdx)
    addq    $64, %rdx
    addq    $-8, %rsi
    jne .LBB2_6
.LBB2_7:
    movq    %r8, %rdx
    movq    %r9, %rdi
.LBB2_8:                                # %middle.block
    cmpq    %rdi, %rcx
    je  .LBB2_11
# BB#9:
    movsd   .LCPI2_1(%rip), %xmm0   # xmm0 = mem[0],zero
    .align  16, 0x90
.LBB2_10:                               # %.lr.ph
                                        # =>This Inner Loop Header: Depth=1
    movsd   (%rdx), %xmm1           # xmm1 = mem[0],zero
    divsd   %xmm0, %xmm1
    movsd   %xmm1, (%rdx)
    addq    $8, %rdx
    cmpq    %rdx, %rax
    jne .LBB2_10
.LBB2_11:                               # %._crit_edge
    retq
.Lfunc_end2:
    .size   _Z15range_based_forRSt6vectorIdSaIdEE, .Lfunc_end2-_Z15range_based_forRSt6vectorIdSaIdEE
    .cfi_endproc

.section    .rodata.cst16,"aM",@progbits,16
    .align  16
.LCPI3_0:
    .quad   4631107791820423168     # double 4.200000e+01
    .quad   4631107791820423168     # double 4.200000e+01
    .section    .rodata.cst8,"aM",@progbits,8
    .align  8
.LCPI3_1:
    .quad   4631107791820423168     # double 42
    .text
    .globl  _Z18for_each_algorithmRSt6vectorIdSaIdEE
    .align  16, 0x90
    .type   _Z18for_each_algorithmRSt6vectorIdSaIdEE,@function
_Z18for_each_algorithmRSt6vectorIdSaIdEE: # @_Z18for_each_algorithmRSt6vectorIdSaIdEE
    .cfi_startproc
# BB#0:
    movq    (%rdi), %rdx
    movq    8(%rdi), %rax
    cmpq    %rax, %rdx
    je  .LBB3_11
# BB#1:                                 # %.lr.ph.i.preheader
    movabsq $4611686018427387900, %rsi # imm = 0x3FFFFFFFFFFFFFFC
    leaq    -8(%rax), %rcx
    subq    %rdx, %rcx
    shrq    $3, %rcx
    incq    %rcx
    xorl    %edi, %edi
    movq    %rcx, %r9
    andq    %rsi, %r9
    je  .LBB3_8
# BB#2:                                 # %vector.body.preheader
    andq    %rcx, %rsi
    leaq    -4(%rsi), %rdi
    movq    %rdi, %r11
    shrq    $2, %r11
    xorl    %r10d, %r10d
    btq $2, %rdi
    jb  .LBB3_4
# BB#3:                                 # %vector.body.prol
    movupd  (%rdx), %xmm0
    movupd  16(%rdx), %xmm1
    movapd  .LCPI3_0(%rip), %xmm2   # xmm2 = [4.200000e+01,4.200000e+01]
    divpd   %xmm2, %xmm0
    divpd   %xmm2, %xmm1
    movupd  %xmm0, (%rdx)
    movupd  %xmm1, 16(%rdx)
    movl    $4, %r10d
.LBB3_4:                                # %vector.body.preheader.split
    leaq    (%rdx,%r9,8), %r8
    testq   %r11, %r11
    je  .LBB3_7
# BB#5:                                 # %vector.body.preheader.split.split
    subq    %r10, %rsi
    leaq    48(%rdx,%r10,8), %rdx
    movapd  .LCPI3_0(%rip), %xmm0   # xmm0 = [4.200000e+01,4.200000e+01]
    .align  16, 0x90
.LBB3_6:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
    movupd  -48(%rdx), %xmm1
    movupd  -32(%rdx), %xmm2
    divpd   %xmm0, %xmm1
    divpd   %xmm0, %xmm2
    movupd  %xmm1, -48(%rdx)
    movupd  %xmm2, -32(%rdx)
    movupd  -16(%rdx), %xmm1
    movupd  (%rdx), %xmm2
    divpd   %xmm0, %xmm1
    divpd   %xmm0, %xmm2
    movupd  %xmm1, -16(%rdx)
    movupd  %xmm2, (%rdx)
    addq    $64, %rdx
    addq    $-8, %rsi
    jne .LBB3_6
.LBB3_7:
    movq    %r8, %rdx
    movq    %r9, %rdi
.LBB3_8:                                # %middle.block
    cmpq    %rdi, %rcx
    je  .LBB3_11
# BB#9:
    movsd   .LCPI3_1(%rip), %xmm0   # xmm0 = mem[0],zero
    .align  16, 0x90
.LBB3_10:                               # %.lr.ph.i
                                        # =>This Inner Loop Header: Depth=1
    movsd   (%rdx), %xmm1           # xmm1 = mem[0],zero
    divsd   %xmm0, %xmm1
    movsd   %xmm1, (%rdx)
    addq    $8, %rdx
    cmpq    %rdx, %rax
    jne .LBB3_10
.LBB3_11:                               # %_ZSt8for_eachIN9__gnu_cxx17__normal_iteratorIPdSt6vectorIdSaIdEEEEZ18for_each_algorithmR5_E3$_0ET0_T_SA_S9_.exit
    retq
.Lfunc_end3:
    .size   _Z18for_each_algorithmRSt6vectorIdSaIdEE, .Lfunc_end3-_Z18for_each_algorithmRSt6vectorIdSaIdEE
    .cfi_endproc

    .ident  "clang version 3.7.0 (tags/RELEASE_370/final 246979)"
    .section    ".note.GNU-stack","",@progbits

answered Nov 15 '22 15:11

Evgeny Panasyuk

Related questions
                            
                                Does g++ compilation order matter?
                            
                                C++11 range-based for on a vector of pointers
                            
                                constexpr question, why do these two different programs run in such a different amount of time with g++?
                            
                                What does #pragma hdrstop without parameter do when used in multiple files?
                            
                                What's the opposite of `fixed` in cout?
                            
                                std::bind and std::function questions
                            
                                Passing `this` before base constructors are done: UB or just dangerous?
                            
                                Is there a way to add a QWidget to a QMenu in QtCreator
                            
                                STL function to test whether a value is within some range?
                            
                                Safe to return a vector populated with local variables?
                            
                                Ambiguous symbol error?
                            
                                assignment operator within function parameter C++
                            
                                c++11 std::async doesn't work in mingw
                            
                                How to define nullptr for supporting both C++03 and C++11? [duplicate]
                            
                                convert String to type const char* using Arduino
                            
                                Memory consumption by STL containers
                            
                                Weird compilation output
                            
                                Partial template specialization for arrays
                            
                                Why does my destructor appear to be called more often than the constructor?
                            
                                operator modulo change in c++ 11? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Performance of C++11 modern-style loops vs old-style loops

Tags:

c++

performance

foreach

c++11

auto