Why is this C++ program so incredibly fast?

Tags:

I wrote a little benchmark to compare the performance of different interpreters/compilers for Python, Ruby, JavaScript and C++. As expected, it turns out that (optimized) C++ beats the scripting languages, but the factor by which it does so is incredibly high.

The results are:

sven@jet:~/tmp/js$ time node bla.js              # * JavaScript with node * 0  real    0m1.222s user    0m1.190s sys 0m0.015s sven@jet:~/tmp/js$ time ruby foo.rb              # * Ruby * 0  real    0m52.428s user    0m52.395s sys 0m0.028s sven@jet:~/tmp/js$ time python blub.py           # * Python with CPython * 0  real    1m16.480s user    1m16.371s sys 0m0.080s  sven@jet:~/tmp/js$ time pypy blub.py             # * Python with PyPy * 0  real    0m4.707s user    0m4.579s sys 0m0.028s  sven@jet:~/tmp/js$ time ./cpp_non_optimized 1000 1000000 # * C++ with -O0 (gcc) * 0  real    0m1.702s user    0m1.699s sys 0m0.002s sven@jet:~/tmp/js$ time ./cpp_optimized 1000 1000000     # * C++ with -O3 (gcc) * 0  real    0m0.003s # (!!!) <---------------------------------- WHY? user    0m0.002s sys 0m0.002s

I am wondering if anybody can provide an explanation why the optimized C++ code is over three orders of magnitude faster than everything else.

The C++ benchmark uses command-line parameters in order to prevent pre-computing the result at compile time.

Below, I placed the source codes for the different language benchmarks, which should be semantically equivalent. Also, I provided the assembly code for the optimized C++ compiler output (using gcc). When looking at the optimized assembly, it seems that the compiler merged the two loops in the benchmark to a single one, but nevertheless, there IS still a loop!

JavaScript:

var s = 0; var outer = 1000; var inner = 1000000;  for (var i = 0; i < outer; ++i) {     for (var j = 0; j < inner; ++j) {         ++s;     }     s -= inner; } console.log(s);

Python:

s = 0 outer = 1000 inner = 1000000  for _ in xrange(outer):     for _ in xrange(inner):         s += 1     s -= inner print s

Ruby:

s = 0 outer = 1000 inner = 1000000  outer_end = outer - 1 inner_end = inner - 1  for i in 0..outer_end   for j in 0..inner_end     s = s + 1   end   s = s - inner end puts s

C++:

#include <iostream> #include <cstdlib> #include <cstdint>  int main(int argc, char* argv[]) {   uint32_t s = 0;   uint32_t outer = atoi(argv[1]);   uint32_t inner = atoi(argv[2]);   for (uint32_t i = 0; i < outer; ++i) {     for (uint32_t j = 0; j < inner; ++j)       ++s;     s -= inner;   }   std::cout << s << std::endl;   return 0; }

Assembly (when compiling the above C++ code with gcc -S -O3 -std=c++0x):

    .file   "bar.cpp"     .section    .text.startup,"ax",@progbits     .p2align 4,,15     .globl  main     .type   main, @function main: .LFB1266:     .cfi_startproc     pushq   %r12     .cfi_def_cfa_offset 16     .cfi_offset 12, -16     movl    $10, %edx     movq    %rsi, %r12     pushq   %rbp     .cfi_def_cfa_offset 24     .cfi_offset 6, -24     pushq   %rbx     .cfi_def_cfa_offset 32     .cfi_offset 3, -32     movq    8(%rsi), %rdi     xorl    %esi, %esi     call    strtol     movq    16(%r12), %rdi     movq    %rax, %rbp     xorl    %esi, %esi     movl    $10, %edx     call    strtol     testl   %ebp, %ebp     je  .L6     movl    %ebp, %ebx     xorl    %eax, %eax     xorl    %edx, %edx     .p2align 4,,10     .p2align 3 .L3:                             # <--- Here is the loop     addl    $1, %eax             # <---     cmpl    %eax, %ebx           # <---     ja  .L3                      # <--- .L2:     movl    %edx, %esi     movl    $_ZSt4cout, %edi     call    _ZNSo9_M_insertImEERSoT_     movq    %rax, %rdi     call    _ZSt4endlIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_     popq    %rbx     .cfi_remember_state     .cfi_def_cfa_offset 24     popq    %rbp     .cfi_def_cfa_offset 16     xorl    %eax, %eax     popq    %r12     .cfi_def_cfa_offset 8     ret .L6:     .cfi_restore_state     xorl    %edx, %edx     jmp .L2     .cfi_endproc .LFE1266:     .size   main, .-main     .p2align 4,,15     .type   _GLOBAL__sub_I_main, @function _GLOBAL__sub_I_main: .LFB1420:     .cfi_startproc     subq    $8, %rsp     .cfi_def_cfa_offset 16     movl    $_ZStL8__ioinit, %edi     call    _ZNSt8ios_base4InitC1Ev     movl    $__dso_handle, %edx     movl    $_ZStL8__ioinit, %esi     movl    $_ZNSt8ios_base4InitD1Ev, %edi     addq    $8, %rsp     .cfi_def_cfa_offset 8     jmp __cxa_atexit     .cfi_endproc .LFE1420:     .size   _GLOBAL__sub_I_main, .-_GLOBAL__sub_I_main     .section    .init_array,"aw"     .align 8     .quad   _GLOBAL__sub_I_main     .local  _ZStL8__ioinit     .comm   _ZStL8__ioinit,1,1     .hidden __dso_handle     .ident  "GCC: (Ubuntu 4.8.2-19ubuntu1) 4.8.2"     .section    .note.GNU-stack,"",@progbits

233

asked Jul 18 '14 15:07

Sven Hager

1 Answers

The optimizer has worked out that the inner loop along with the subsequent line is a no-op, and eliminated it. Unfortunately it hasn't managed to eliminate the outer loop as well.

Note that the node.js example is faster than the unoptimized C++ example, indicating that V8 (node's JIT compiler) has managed to eliminate at least one of the loops. However, its optimization has some overhead, as (like any JIT compiler) it must balance the opportunities for optimization and profile-guided re-optimization against the cost of doing so.

116

answered Sep 24 '22 07:09

ecatmur

Related questions
                            
                                How does dereferencing of a function pointer happen?
                            
                                Is there any 'out-of-the-box' 2D/3D plotting library for C++? [closed]
                            
                                Type of `this` in static member function?
                            
                                C++ Return value, reference, const reference
                            
                                Why must a short be converted to an int before arithmetic operations in C and C++?
                            
                                Why is a const variable sometimes not required to be captured in a lambda?
                            
                                How to create a vector of user defined size but with no predefined values? [closed]
                            
                                How to convert typename T to string in c++ [duplicate]
                            
                                How bad is "if (!this)" in a C++ member function?
                            
                                Does 'auto' type assignments of a pointer in c++11 require '*'?
                            
                                Derived template-class access to base-class member-data
                            
                                inline vs __inline vs __inline__ vs __forceinline?
                            
                                What is the difference between WM_QUIT, WM_CLOSE, and WM_DESTROY in a windows program?
                            
                                static variable link error [duplicate]
                            
                                Copying std::vector: prefer assignment or std::copy?
                            
                                Object destruction in C++
                            
                                Prepend std::string
                            
                                Is a whole number float divided by itself guaranteed to be 1.f?
                            
                                Circular lock-free buffer
                            
                                #include all .cpp files into a single compilation unit?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is this C++ program so incredibly fast?

Tags:

c++

performance

assembly

compiler-construction

Sven Hager

People also ask

1 Answers

ecatmur

Recent Activity

Donate For Us