Visual C++ optimization options - how to improve the code output?

Question

Are there any options (other than /O2) to improve the Visual C++ code output? The MSDN documentation is quite bad in this regard. Note that I'm not asking about project-wide settings (link-time optimization, etc...). I'm only interested in this particular example.

The fairly simple C++11 code looks like this:

#include <vector>
int main() {
    std::vector<int> v = {1, 2, 3, 4};
    int sum = 0;
    for(int i = 0; i < v.size(); i++) {
        sum += v[i];
    }
    return sum;
}

Clang's output with libc++ is quite compact:

main: # @main
  mov eax, 10
  ret

Visual C++ output, on the other hand, is a multi-page mess. Am I missing something here or is VS really this bad?

Compiler explorer link: https://godbolt.org/g/GJYHjE

valiano · Accepted Answer

Unfortunately, it's difficult to greatly improve Visual C++ output in this case, even by using more aggressive optimization flags. There are several factors contributing to VS inefficiency, including lack of certain compiler optimizations, and the structure of Microsoft's implementation of <vector>.

Inspecting the generated assembly, Clang does an outstanding job optimizing this code. Specifically, when compared to VS, Clang is able to perform a very effective Constant propagation, Function Inlining (and consequently, Dead Code Elimination), and New/delete optimization.

Constant Propagation

In the example, the vector is statically initialized:

std::vector<int> v = {1, 2, 3, 4};

Normally, the compiler will store the constants 1, 2, 3, 4 in the data memory, and in the for loop, will load one value at one at a time, starting from the low address in which 1 is stored, and add each value to the sum.

Here's the abbreviated VS code for doing this:

movdqa   xmm0, XMMWORD PTR __xmm@00000004000000030000000200000001
...
movdqu   XMMWORD PTR $T1[rsp], xmm0 ; Store integers 1, 2, 3, 4 in memory
...
$LL4@main:
    add      ebx, DWORD PTR [rdx]   ; loop and sum the values
    lea      rdx, QWORD PTR [rdx+4]
    inc      r8d
    movsxd   rax, r8d
    cmp      rax, r9
    jb       SHORT $LL4@main

Clang, however, is very clever to realize that the sum could be calculated in advance. My best guess is that it replaces the loading of the constants from memory to constant mov operations into registers (propagates the constants), and then combines them into the result of 10. This has the useful side effect of breaking dependencies, and since the addresses are no longer loaded from, the compiler is free to remove everything else as dead code.

Clang seems to be unique in doing this - neither VS or GCC were able to precalculate the vector accumulation result in advance.

New/Delete Optimization

Compilers conforming to C++14 are allowed to omit calls to new and delete on certain conditions, specifically when the number of allocation calls is not part of the observable behavior of the program (N3664 standard paper). This has already generated much discussion on SO:

clang vs gcc - optimization including operator new
Is the compiler allowed to optimize out heap memory allocations?
Optimization of raw new[]/delete[] vs std::vector

Clang invoked with -std=c++14 -stdlib=libc++ indeed performs this optimization and eliminates the calls to new and delete, which do carry side effects, but supposedly do not affect the observable behaviour of the program. With -stdlib=libstdc++, Clang is stricter and keeps the calls to new and delete - although, by looking at the assembly, it's clear they are not really needed.

Now, when inspecting the main code generated by VS, we can find there two function calls (with the rest of vector construction and iteration code inlined into main):

call std::vector<int,std::allocator<int> >::_Range_construct_or_tidy<int const * __ptr64>

and

call void __cdecl operator delete(void * __ptr64)

The first is used for allocating the vector, and the second for deallocating it, and practically all other functions in the VS output are pulled in by these functions calls. This hints that Visual C++ will not optimize away calls to allocation functions (for C++14 conformance we should add the /std:c++14 flag, but the results are the same).

This blog post (May 10, 2017) from the Visual C++ team confirms that indeed, this optimization is not implemented. Searching the page for N3664 shows that "Avoiding/fusing allocations" is at status N/A, and linked comment says:

[E] Avoiding/fusing allocations is permitted but not required. For the time being, we’ve chosen not to implement this.

Combining new/delete optimization and constant propagation, it's easy to see the impact of these two optimizations in this Compiler Explorer 3-way comparison of Clang with -stdlib=libc++, Clang with -stdlib=libstdc++, and GCC.

STL Implementation

VS has its own STL implementation which is very differently structured than libc++ and stdlibc++, and that seems to have a large contribution to VS inferior code generation. While VS STL has some very useful features, such as checked iterators and iterator debugging hooks (_ITERATOR_DEBUG_LEVEL), it gives the general impression of being heavier and to perform less efficiently than stdlibc++.

For isolating the impact of the vector STL implementation, an interesting experiment is to use Clang for compilation, combined with the VS header files. Indeed, using Clang 5.0.0 with Visual Studio 2015 headers, results in the following code generation - clearly, the STL implementation has a huge impact!

main:                                   # @main
.Lfunc_begin0:
.Lcfi0:
.seh_proc main
    .seh_handler __CxxFrameHandler3, @unwind, @except
# BB#0:                                 # %.lr.ph
    pushq   %rbp
.Lcfi1:
    .seh_pushreg 5
    pushq   %rsi
.Lcfi2:
    .seh_pushreg 6
    pushq   %rdi
.Lcfi3:
    .seh_pushreg 7
    pushq   %rbx
.Lcfi4:
    .seh_pushreg 3
    subq    $72, %rsp
.Lcfi5:
    .seh_stackalloc 72
    leaq    64(%rsp), %rbp
.Lcfi6:
    .seh_setframe 5, 64
.Lcfi7:
    .seh_endprologue
    movq    $-2, (%rbp)
    movl    $16, %ecx
    callq   "??2@YAPEAX_K@Z"
    movq    %rax, -24(%rbp)
    leaq    16(%rax), %rcx
    movq    %rcx, -8(%rbp)
    movups  .L.ref.tmp(%rip), %xmm0
    movups  %xmm0, (%rax)
    movq    %rcx, -16(%rbp)
    movl    4(%rax), %ebx
    movl    8(%rax), %esi
    movl    12(%rax), %edi
.Ltmp0:
    leaq    -24(%rbp), %rcx
    callq   "?_Tidy@?$vector@HV?$allocator@H@std@@@std@@IEAAXXZ"
.Ltmp1:
# BB#1:                                 # %"\01??1?$vector@HV?$allocator@H@std@@@std@@QEAA@XZ.exit"
    addl    %ebx, %esi
    leal    1(%rdi,%rsi), %eax
    addq    $72, %rsp
    popq    %rbx
    popq    %rdi
    popq    %rsi
    popq    %rbp
    retq
    .seh_handlerdata
    .long   ($cppxdata$main)@IMGREL
    .text

Update - Visual Studio 2017

In Visual Studio 2017, <vector> has seen a major overhaul, as announced on this blog post from the Visual C++ team. Specifically, it mentions the following optimizations:

Eliminated unnecessary EH logic. For example, vector’s copy assignment operator had an unnecessary try-catch block. It just has to provide the basic guarantee, which we can achieve through proper action sequencing.

Improved performance by avoiding unnecessary rotate() calls. For example, emplace(where, val) was calling emplace_back() followed by rotate(). Now, vector calls rotate() in only one scenario (range insertion with input-only iterators, as previously described).

Improved performance with stateful allocators. For example, move construction with non-equal allocators now attempts to activate our memmove() optimization. (Previously, we used make_move_iterator(), which had the side effect of inhibiting the memmove() optimization.) Note that a further improvement is coming in VS 2017 Update 1, where move assignment will attempt to reuse the buffer in the non-POCMA non-equal case.

Curious, I went back to test this. When building the example in Visual Studio 2017, the result is still a multi page assembly listing, with many function calls, so even if code generation improved, it is difficult to notice.

However, when building with clang 5.0.0 and Visual Studio 2017 headers, we get the following assembly:

main:                                   # @main
.Lcfi0:
.seh_proc main
# BB#0:
    subq    $40, %rsp
.Lcfi1:
    .seh_stackalloc 40
.Lcfi2:
    .seh_endprologue
    movl    $16, %ecx
    callq   "??2@YAPEAX_K@Z" ; void * __ptr64 __cdecl operator new(unsigned __int64)
    movq    %rax, %rcx
    callq   "??3@YAXPEAX@Z" ; void __cdecl operator delete(void * __ptr64)
    movl    $10, %eax
    addq    $40, %rsp
    retq
    .seh_handlerdata
    .text

Note the movl $10, %eax instruction - that is, with VS 2017's <vector>, clang was able to collapse everything, precalculate the result of 10, and keep only the calls to new and delete.

I'd say that is pretty amazing!

Function Inlining

Function inlining is probably the single most vital optimization in this example. By collapsing the code of called functions into their call sites, the compiler is able to perform further optimizations on the merged code, plus, removing of function calls is beneficial in reducing call overhead and removing of optimization barriers.

When inspecting the generated assembly for VS, and comparing the code before and after inlining (Compiler Explorer), we can see that most vector functions were indeed inlined, except for the allocation and deallocation functions. In particular, there are calls to memmove, which are the result of inlining of some higher level functions, such as _Uninitialized_copy_al_unchecked.

memmove is a library function, and therefore cannot be inlined. However, clang has a clever way around this - it replaces the call to memmove with a call to __builtin_memmove. __builtin_memmove is a builtin/intrinsic function, which has the same functionality as memmove, but as opposed to the plain function call, the compiler generates code for it and embeds it into the calling function. Consequently, the code could be further optimized inside the calling function and eventually removed as dead code.

Summary

To conclude, Clang is clearly superior than VS in this example, both thanks to high quality optimizations, and more efficient vector STL implementation. When using the same header files for Visual C++ and clang (the Visual Studio 2017 headers), Clang beats Visual C++ hands down.

While writing this answer, I couldn't help not to think, what would we do without Compiler Explorer? Thanks Matt Godbolt for this amazing tool!

Visual C++ optimization options - how to improve the code output?

Tags:

c++

c++11

visual-c++

cl

Alexander

1 Answers

valiano

Recent Activity

Donate For Us

Visual C++ optimization options - how to improve the code output?

Tags:

c++

c++11

visual-c++

cl

Alexander

1 Answers

valiano

Related questions

Recent Activity

Donate For Us