Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Profiling inlined C++ functions with Visual Studio Compiler

How can I make sense of C++ profiling data on Windows, when a lot of code gets inlined by the compiler? I.e. I of course want to measure the code that actually gets run, so by definition I'm going to measure an optimized build of the code. But it seems like none of the tools I try actually manage to resolve inline functions.

I have tried both the sampling profiler in Visual Studio 2017 Professional as well as VTune 2018. I have tried to enable /Zo, but it does not seem to have any affect.

I have found the following resource which seems to indicate that only Visual Studio Ultimate or Premium support inline frame information - is this still true for Visual Studio 2017? https://social.msdn.microsoft.com/Forums/en-US/9df15363-5aae-4f0b-a5ad-dd9939917d4c/which-functions-arent-pgo-optimized-using-profile-data?forum=vsdebug

Here is an example code:

#include <cmath>
#include <random>
#include <iostream>

inline double burn()
{
    std::uniform_real_distribution<double> uniform(-1E5, 1E5);
    std::default_random_engine engine;
    double s = 0;
    for (int i = 0; i < 100000000; ++i) {
        s += uniform(engine);
    }
    return s;
}

int main()
{
    std::cout << "random sum: " << burn() << '\n';
    return 0;
}

Compile it with Visual Studio in Release mode. Or on the command line, try cl /O2 /Zi /Zo /EHsc main.cpp. Then try to profile it with the CPU Sampling Profiler in Visual Studio. You will at most see something like this:

confusing profile since inline frames are missing

VTune 2018 looks similar on Windows. On Linux, perf and VTune have no problem showing frames from inlined functions... Is this feature, which is in my opinion crucial for C++ tooling, really not part of the non-Premium/Ultimate Visual Studio toolchains? How do people on Windows deal with that? What is the point of /Zo then?

EDIT: I just tried to compile the minimal example above with clang and it produces different, but still unsatisfying results? I compiled clang 6.0.0 (trunk), build from LLVM rev 318844 and clang rev 318874. Then I compile my code with clang++ -std=c++17 -O2 -g main.cpp -o main.exe and run the resulting executable with the Sampling Profiler in Visual Studio again, the result is:

inline frames are shown in profile after compiling with clang

So now I see the burn function, but lost the source file information. Also, the uniform_real_distribution is still not being shown anywhere.

EDIT 2: As suggested in the comments, I now also tried out clang-cl with the same arguments as cl above, i.e.: clang-cl.exe /O2 /Zi /Zo /EHsc main.cpp. This produces the same results as clang.exe, but we also get somewhat working source mappings:

clang-cl shows inliners and somewhat functional source mapping

EDIT 3: I originally thought clang would magically solve this issue. It doesn't, sadly. Most inlined frames are still missing :(

EDIT 4: Inline frames are not supported in VTune for applicatoins build with MSVC/PDB builds: https://software.intel.com/en-us/forums/intel-vtune-amplifier-xe/topic/749363

like image 545
milianw Avatar asked Nov 28 '17 21:11

milianw


People also ask

How do you check if a function is inlined or not?

The decision to inline or not a function is made by compiler. And since it is made by compiler, so YES, it can be made at compile time only. So, if you can see the assembly code by using -S option (with gcc -S produces assembly code), you can see whether your function has been inlined or not. Save this answer.

Does GCC automatically inline functions?

GCC automatically inlines member functions defined within the class body of C++ programs even if they are not explicitly declared inline . (You can override this with -fno-default-inline ; see Options Controlling C++ Dialect.)

When can a function be inlined?

An inline function is one for which the compiler copies the code from the function definition directly into the code of the calling function rather than creating a separate set of instructions in memory. This eliminates call-linkage overhead and can expose significant optimization opportunities.

What is __ inline in C?

The inline keyword tells the compiler to substitute the code within the function definition for every instance of a function call. Using inline functions can make your program faster because they eliminate the overhead associated with function calls.


1 Answers

I have tried both the sampling profiler in Visual Studio 2017 Professional as well as VTune 2018. I have tried to enable /Zo, but it does not seem to have any affect.

I have found the following resource which seems to indicate that only Visual Studio Ultimate or Premium support inline frame information - is this still true for Visual Studio 2017?

Fortunately, I already have three different versions of VS installed. I can tell you more information on the support for the inlined functions information feature as discussed in the article you referenced:

  • VS Community 2013 Update 5 does not support showing inlined functions even when I specify /d2Zi+. It seems that it is only supported in VS 2013 Premium or Ultimate.
  • VS Community 2015 Update 3 does support showing inlined functions (the feature discussed in the article). By default, /Zi is specified. /Zo is enabled implicitly with /Zi, so you don't have to specify it explicitly. Therefore, you don't need VS 2015 Premium or Ultimate.
  • VS Community 2017 with the latest update does not support showing inlined functions irrespective of /Zi and /Zo. It seems that it is only supported in VS 2017 Professional and/or Enterprise.

There is no announcement on the VC++ blog regarding any improvements to the VS 2017 sampling profiler, so I don't think it is any better compared to the profiler of VS Community 2015.

Note that different versions of the compiler may make different optimization decisions. For example, I've observed that VS 2013 and 2015 don't inline the burn function.

By using VS Community 2015 Update 3, I get profiling results very similar to what is shown in the third picture and the same code is highlighted.

Now I will discuss how this additional information can be useful when interpreting the profiling results, how can you get that manually with some more effort, and how to interpret the results despite of inlined functions.

How can I make sense of C++ profiling data on Windows, when a lot of code gets inlined by the compiler?

The VS profiler will only attribute costs to functions that were not inlined. For functions that were inlined, the costs will be added up and included in some caller function that was not inlined (in this case, the burn function).

By adding up the estimated execution time of the non-inlined called functions from burn (as shown in the picture), we get 31.3 + 22.7 + 4.7 + 1.1 = 59.8%. In addition, the estimated execution time of the Function Body as shown in the picture is 40.2%. Note that 59.8% + 40.2% = 100% of the time spent in burn, as it should be. In other words, 40.2% of the time spent in burn was spent in the body of the function and any functions that were inlined in it.

40.2% is a lot. The next logical question is, which functions get inlined in burn? By using that feature I discussed earlier (which is available in VS Community 2015), I can determine that the following functions were inlined in burn:

std::mersenne_twister_engine<unsigned int,32,624,397,31,2567483615,11,4294967295,7,2636928640,15,4022730752,18,1812433253>::{ctor};
std::mersenne_twister<unsigned int,32,624,397,31,2567483615,11,7,2636928640,15,4022730752,18>::{ctor};
std::mersenne_twister<unsigned int,32,624,397,31,2567483615,11,7,2636928640,15,4022730752,18>::seed;
std::uniform_real<double>::operator();
std::uniform_real<double>::_Eval;
std::generate_canonical;

Without that feature, you'll have to manually disassemble the emitted executable binary (either using the VS debugger or using dumpbin) and locate all the x86 call instructions. By comparing that with the functions called in the source code, you can determine which functions got inlined.

The capabilities of the VS sampling profiler up to and including VS 2017 end at this point. But it's really not a significant restriction. Typically, not many functions get inlined in the same function due to a hard upper limit imposed by the compiler on the size of each function. So it's generally possible to manually check the source code and/or the assembly code of each inlined function and see if that code would contribute significantly to the execution time. I did that and it's likely the case that the body of burn (excluding inlined functions) and these two inlined functions are mostly responsible for that 40.2%.

std::mersenne_twister<unsigned int,32,624,397,31,2567483615,11,7,2636928640,15,4022730752,18>::seed;
std::uniform_real<double>::_Eval;

Putting all of that into consideration, the only potential optimization opportunity I see here is to memoize the results of log2.

The VTune sampling profiler is certainly more powerful than the VS sampling profiler. In particular, VTune attributes costs to individual source code lines or assembly instructions. However, this attribution is highly approximated and often nonsensical. So I would be very careful when interpreting the results visualized in that way. I'm not sure whether VTune supports the Enhance Optimized Debugging information or to what degree it supports attributing costs to inlined functions. The best place to ask these questions is the Intel VTune Amplifier community forum.

like image 114
Hadi Brais Avatar answered Oct 07 '22 17:10

Hadi Brais