How can I make sense of C++ profiling data on Windows, when a lot of code gets inlined by the compiler? I.e. I of course want to measure the code that actually gets run, so by definition I'm going to measure an optimized build of the code. But it seems like none of the tools I try actually manage to resolve inline functions.
I have tried both the sampling profiler in Visual Studio 2017 Professional as well as VTune 2018. I have tried to enable /Zo
, but it does not seem to have any affect.
I have found the following resource which seems to indicate that only Visual Studio Ultimate or Premium support inline frame information - is this still true for Visual Studio 2017? https://social.msdn.microsoft.com/Forums/en-US/9df15363-5aae-4f0b-a5ad-dd9939917d4c/which-functions-arent-pgo-optimized-using-profile-data?forum=vsdebug
Here is an example code:
#include <cmath>
#include <random>
#include <iostream>
inline double burn()
{
std::uniform_real_distribution<double> uniform(-1E5, 1E5);
std::default_random_engine engine;
double s = 0;
for (int i = 0; i < 100000000; ++i) {
s += uniform(engine);
}
return s;
}
int main()
{
std::cout << "random sum: " << burn() << '\n';
return 0;
}
Compile it with Visual Studio in Release mode. Or on the command line, try cl /O2 /Zi /Zo /EHsc main.cpp
. Then try to profile it with the CPU Sampling Profiler in Visual Studio. You will at most see something like this:
VTune 2018 looks similar on Windows. On Linux, perf and VTune have no problem showing frames from inlined functions... Is this feature, which is in my opinion crucial for C++ tooling, really not part of the non-Premium/Ultimate Visual Studio toolchains? How do people on Windows deal with that? What is the point of /Zo
then?
EDIT: I just tried to compile the minimal example above with clang and it produces different, but still unsatisfying results? I compiled clang 6.0.0 (trunk), build from LLVM rev 318844 and clang rev 318874. Then I compile my code with clang++ -std=c++17 -O2 -g main.cpp -o main.exe
and run the resulting executable with the Sampling Profiler in Visual Studio again, the result is:
So now I see the burn
function, but lost the source file information. Also, the uniform_real_distribution
is still not being shown anywhere.
EDIT 2: As suggested in the comments, I now also tried out clang-cl
with the same arguments as cl
above, i.e.: clang-cl.exe /O2 /Zi /Zo /EHsc main.cpp
. This produces the same results as clang.exe
, but we also get somewhat working source mappings:
EDIT 3: I originally thought clang would magically solve this issue. It doesn't, sadly. Most inlined frames are still missing :(
EDIT 4: Inline frames are not supported in VTune for applicatoins build with MSVC/PDB builds: https://software.intel.com/en-us/forums/intel-vtune-amplifier-xe/topic/749363
The decision to inline or not a function is made by compiler. And since it is made by compiler, so YES, it can be made at compile time only. So, if you can see the assembly code by using -S option (with gcc -S produces assembly code), you can see whether your function has been inlined or not. Save this answer.
GCC automatically inlines member functions defined within the class body of C++ programs even if they are not explicitly declared inline . (You can override this with -fno-default-inline ; see Options Controlling C++ Dialect.)
An inline function is one for which the compiler copies the code from the function definition directly into the code of the calling function rather than creating a separate set of instructions in memory. This eliminates call-linkage overhead and can expose significant optimization opportunities.
The inline keyword tells the compiler to substitute the code within the function definition for every instance of a function call. Using inline functions can make your program faster because they eliminate the overhead associated with function calls.
I have tried both the sampling profiler in Visual Studio 2017 Professional as well as VTune 2018. I have tried to enable /Zo, but it does not seem to have any affect.
I have found the following resource which seems to indicate that only Visual Studio Ultimate or Premium support inline frame information - is this still true for Visual Studio 2017?
Fortunately, I already have three different versions of VS installed. I can tell you more information on the support for the inlined functions information feature as discussed in the article you referenced:
There is no announcement on the VC++ blog regarding any improvements to the VS 2017 sampling profiler, so I don't think it is any better compared to the profiler of VS Community 2015.
Note that different versions of the compiler may make different optimization decisions. For example, I've observed that VS 2013 and 2015 don't inline the burn
function.
By using VS Community 2015 Update 3, I get profiling results very similar to what is shown in the third picture and the same code is highlighted.
Now I will discuss how this additional information can be useful when interpreting the profiling results, how can you get that manually with some more effort, and how to interpret the results despite of inlined functions.
How can I make sense of C++ profiling data on Windows, when a lot of code gets inlined by the compiler?
The VS profiler will only attribute costs to functions that were not inlined. For functions that were inlined, the costs will be added up and included in some caller function that was not inlined (in this case, the burn
function).
By adding up the estimated execution time of the non-inlined called functions from burn
(as shown in the picture), we get 31.3 + 22.7 + 4.7 + 1.1 = 59.8%. In addition, the estimated execution time of the Function Body
as shown in the picture is 40.2%. Note that 59.8% + 40.2% = 100% of the time spent in burn
, as it should be. In other words, 40.2% of the time spent in burn
was spent in the body of the function and any functions that were inlined in it.
40.2% is a lot. The next logical question is, which functions get inlined in burn
? By using that feature I discussed earlier (which is available in VS Community 2015), I can determine that the following functions were inlined in burn
:
std::mersenne_twister_engine<unsigned int,32,624,397,31,2567483615,11,4294967295,7,2636928640,15,4022730752,18,1812433253>::{ctor};
std::mersenne_twister<unsigned int,32,624,397,31,2567483615,11,7,2636928640,15,4022730752,18>::{ctor};
std::mersenne_twister<unsigned int,32,624,397,31,2567483615,11,7,2636928640,15,4022730752,18>::seed;
std::uniform_real<double>::operator();
std::uniform_real<double>::_Eval;
std::generate_canonical;
Without that feature, you'll have to manually disassemble the emitted executable binary (either using the VS debugger or using dumpbin) and locate all the x86 call
instructions. By comparing that with the functions called in the source code, you can determine which functions got inlined.
The capabilities of the VS sampling profiler up to and including VS 2017 end at this point. But it's really not a significant restriction. Typically, not many functions get inlined in the same function due to a hard upper limit imposed by the compiler on the size of each function. So it's generally possible to manually check the source code and/or the assembly code of each inlined function and see if that code would contribute significantly to the execution time. I did that and it's likely the case that the body of burn
(excluding inlined functions) and these two inlined functions are mostly responsible for that 40.2%.
std::mersenne_twister<unsigned int,32,624,397,31,2567483615,11,7,2636928640,15,4022730752,18>::seed;
std::uniform_real<double>::_Eval;
Putting all of that into consideration, the only potential optimization opportunity I see here is to memoize the results of log2
.
The VTune sampling profiler is certainly more powerful than the VS sampling profiler. In particular, VTune attributes costs to individual source code lines or assembly instructions. However, this attribution is highly approximated and often nonsensical. So I would be very careful when interpreting the results visualized in that way. I'm not sure whether VTune supports the Enhance Optimized Debugging information or to what degree it supports attributing costs to inlined functions. The best place to ask these questions is the Intel VTune Amplifier community forum.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With