Are there any options (other than /O2) to improve the Visual C++ code output? The MSDN documentation is quite bad in this regard. Note that I'm not asking about project-wide settings (link-time optimization, etc...). I'm only interested in this particular example.
The fairly simple C++11 code looks like this:
#include <vector>
int main() {
std::vector<int> v = {1, 2, 3, 4};
int sum = 0;
for(int i = 0; i < v.size(); i++) {
sum += v[i];
}
return sum;
}
Clang's output with libc++ is quite compact:
main: # @main
mov eax, 10
ret
Visual C++ output, on the other hand, is a multi-page mess. Am I missing something here or is VS really this bad?
Compiler explorer link: https://godbolt.org/g/GJYHjE
Unfortunately, it's difficult to greatly improve Visual C++ output in this case, even by using more aggressive optimization flags. There are several factors contributing to VS inefficiency, including lack of certain compiler optimizations, and the structure of Microsoft's implementation of <vector>
.
Inspecting the generated assembly, Clang does an outstanding job optimizing this code. Specifically, when compared to VS, Clang is able to perform a very effective Constant propagation, Function Inlining (and consequently, Dead Code Elimination), and New/delete optimization.
Constant Propagation
In the example, the vector is statically initialized:
std::vector<int> v = {1, 2, 3, 4};
Normally, the compiler will store the constants 1, 2, 3, 4 in the data memory, and in the for loop, will load one value at one at a time, starting from the low address in which 1 is stored, and add each value to the sum.
Here's the abbreviated VS code for doing this:
movdqa xmm0, XMMWORD PTR __xmm@00000004000000030000000200000001
...
movdqu XMMWORD PTR $T1[rsp], xmm0 ; Store integers 1, 2, 3, 4 in memory
...
$LL4@main:
add ebx, DWORD PTR [rdx] ; loop and sum the values
lea rdx, QWORD PTR [rdx+4]
inc r8d
movsxd rax, r8d
cmp rax, r9
jb SHORT $LL4@main
Clang, however, is very clever to realize that the sum could be calculated in advance. My best guess is that it replaces the loading of the constants from memory to constant mov operations into registers (propagates the constants), and then combines them into the result of 10. This has the useful side effect of breaking dependencies, and since the addresses are no longer loaded from, the compiler is free to remove everything else as dead code.
Clang seems to be unique in doing this - neither VS or GCC were able to precalculate the vector accumulation result in advance.
New/Delete Optimization
Compilers conforming to C++14 are allowed to omit calls to new and delete on certain conditions, specifically when the number of allocation calls is not part of the observable behavior of the program (N3664 standard paper). This has already generated much discussion on SO:
Clang invoked with -std=c++14 -stdlib=libc++
indeed performs this optimization and eliminates the calls to new and delete, which do carry side effects, but supposedly do not affect the observable behaviour of the program. With -stdlib=libstdc++
, Clang is stricter and keeps the calls to new and delete - although, by looking at the assembly, it's clear they are not really needed.
Now, when inspecting the main
code generated by VS, we can find there two function calls (with the rest of vector construction and iteration code inlined into main
):
call std::vector<int,std::allocator<int> >::_Range_construct_or_tidy<int const * __ptr64>
and
call void __cdecl operator delete(void * __ptr64)
The first is used for allocating the vector, and the second for deallocating it, and practically all other functions in the VS output are pulled in by these functions calls. This hints that Visual C++ will not optimize away calls to allocation functions (for C++14 conformance we should add the /std:c++14
flag, but the results are the same).
This blog post (May 10, 2017) from the Visual C++ team confirms that indeed, this optimization is not implemented. Searching the page for N3664
shows that "Avoiding/fusing allocations" is at status N/A, and linked comment says:
[E] Avoiding/fusing allocations is permitted but not required. For the time being, we’ve chosen not to implement this.
Combining new/delete optimization and constant propagation, it's easy to see the impact of these two optimizations in this Compiler Explorer 3-way comparison of Clang with -stdlib=libc++
, Clang with -stdlib=libstdc++
, and GCC.
STL Implementation
VS has its own STL implementation which is very differently structured than libc++ and stdlibc++, and that seems to have a large contribution to VS inferior code generation. While VS STL has some very useful features, such as checked iterators and iterator debugging hooks (_ITERATOR_DEBUG_LEVEL
), it gives the general impression of being heavier and to perform less efficiently than stdlibc++.
For isolating the impact of the vector STL implementation, an interesting experiment is to use Clang for compilation, combined with the VS header files. Indeed, using Clang 5.0.0 with Visual Studio 2015 headers, results in the following code generation - clearly, the STL implementation has a huge impact!
main: # @main
.Lfunc_begin0:
.Lcfi0:
.seh_proc main
.seh_handler __CxxFrameHandler3, @unwind, @except
# BB#0: # %.lr.ph
pushq %rbp
.Lcfi1:
.seh_pushreg 5
pushq %rsi
.Lcfi2:
.seh_pushreg 6
pushq %rdi
.Lcfi3:
.seh_pushreg 7
pushq %rbx
.Lcfi4:
.seh_pushreg 3
subq $72, %rsp
.Lcfi5:
.seh_stackalloc 72
leaq 64(%rsp), %rbp
.Lcfi6:
.seh_setframe 5, 64
.Lcfi7:
.seh_endprologue
movq $-2, (%rbp)
movl $16, %ecx
callq "??2@YAPEAX_K@Z"
movq %rax, -24(%rbp)
leaq 16(%rax), %rcx
movq %rcx, -8(%rbp)
movups .L.ref.tmp(%rip), %xmm0
movups %xmm0, (%rax)
movq %rcx, -16(%rbp)
movl 4(%rax), %ebx
movl 8(%rax), %esi
movl 12(%rax), %edi
.Ltmp0:
leaq -24(%rbp), %rcx
callq "?_Tidy@?$vector@HV?$allocator@H@std@@@std@@IEAAXXZ"
.Ltmp1:
# BB#1: # %"\01??1?$vector@HV?$allocator@H@std@@@std@@[email protected]"
addl %ebx, %esi
leal 1(%rdi,%rsi), %eax
addq $72, %rsp
popq %rbx
popq %rdi
popq %rsi
popq %rbp
retq
.seh_handlerdata
.long ($cppxdata$main)@IMGREL
.text
Update - Visual Studio 2017
In Visual Studio 2017, <vector>
has seen a major overhaul, as announced on this blog post from the Visual C++ team. Specifically, it mentions the following optimizations:
Eliminated unnecessary EH logic. For example, vector’s copy assignment operator had an unnecessary try-catch block. It just has to provide the basic guarantee, which we can achieve through proper action sequencing.
Improved performance by avoiding unnecessary rotate() calls. For example, emplace(where, val) was calling emplace_back() followed by rotate(). Now, vector calls rotate() in only one scenario (range insertion with input-only iterators, as previously described).
Improved performance with stateful allocators. For example, move construction with non-equal allocators now attempts to activate our memmove() optimization. (Previously, we used make_move_iterator(), which had the side effect of inhibiting the memmove() optimization.) Note that a further improvement is coming in VS 2017 Update 1, where move assignment will attempt to reuse the buffer in the non-POCMA non-equal case.
Curious, I went back to test this. When building the example in Visual Studio 2017, the result is still a multi page assembly listing, with many function calls, so even if code generation improved, it is difficult to notice.
However, when building with clang 5.0.0 and Visual Studio 2017 headers, we get the following assembly:
main: # @main
.Lcfi0:
.seh_proc main
# BB#0:
subq $40, %rsp
.Lcfi1:
.seh_stackalloc 40
.Lcfi2:
.seh_endprologue
movl $16, %ecx
callq "??2@YAPEAX_K@Z" ; void * __ptr64 __cdecl operator new(unsigned __int64)
movq %rax, %rcx
callq "??3@YAXPEAX@Z" ; void __cdecl operator delete(void * __ptr64)
movl $10, %eax
addq $40, %rsp
retq
.seh_handlerdata
.text
Note the movl $10, %eax
instruction - that is, with VS 2017's <vector>
, clang was able to collapse everything, precalculate the result of 10, and keep only the calls to new and delete.
I'd say that is pretty amazing!
Function Inlining
Function inlining is probably the single most vital optimization in this example. By collapsing the code of called functions into their call sites, the compiler is able to perform further optimizations on the merged code, plus, removing of function calls is beneficial in reducing call overhead and removing of optimization barriers.
When inspecting the generated assembly for VS, and comparing the code before and after inlining (Compiler Explorer), we can see that most vector functions were indeed inlined, except for the allocation and deallocation functions. In particular, there are calls to memmove
, which are the result of inlining of some higher level functions, such as _Uninitialized_copy_al_unchecked
.
memmove
is a library function, and therefore cannot be inlined. However, clang has a clever way around this - it replaces the call to memmove
with a call to __builtin_memmove
. __builtin_memmove
is a builtin/intrinsic function, which has the same functionality as memmove
, but as opposed to the plain function call, the compiler generates code for it and embeds it into the calling function. Consequently, the code could be further optimized inside the calling function and eventually removed as dead code.
Summary
To conclude, Clang is clearly superior than VS in this example, both thanks to high quality optimizations, and more efficient vector STL implementation. When using the same header files for Visual C++ and clang (the Visual Studio 2017 headers), Clang beats Visual C++ hands down.
While writing this answer, I couldn't help not to think, what would we do without Compiler Explorer? Thanks Matt Godbolt for this amazing tool!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With