Does profile-guided optimization done by compiler notably hurt cases not covered with profiling dataset?

Question

This question is not specific to C++, AFAIK certain runtimes like Java RE can do profiled-guided optimization on the fly, I'm interested in that too.

MSDN describes PGO like this:

I instrument my program and run it under profiler, then
the compiler uses data gathered by profiler to automatically reorganize branching and loops in such way that branch misprediction is reduced and most often run code is placed compactly to improve its locality

Now obviously profiling result will depend on a dataset used.

With normal manual profiling and optimization I'd find some bottlenecks and improve those bottlenecks and likely leave all the other code untouched. PGO seems to improve often run code at expense of making rarely run code slower.

Now what if that slowered code is run often on another dataset that the program will see in real world? Will the program performance degrade compared to a program compiled without PGO and how bad will the degradation likely be? In other word, does PGO really improve my code performance for the profiling dataset and possibly worsen it for other datasets? Are there any real examples with real data?

peterchen · Accepted Answer

Disclaimer: I have not done more with PGO than read up on it and tried it once with a sample project for fun. A lot of the following is based on my experience with the "non-PGO" optimizations and educated guesses. TL;DR below.

This page lists the optimizations done by PGO. Lets look at them one-by-one (grouped by impact):

Inlining – For example, if there exists a function A that frequently calls function B, and function B is relatively small, then profile-guided optimizations will inline function B in function A.

Register Allocation – Optimizing with profile data results in better register allocation.

Virtual Call Speculation – If a virtual call, or other call through a function pointer, frequently targets a certain function, a profile-guided optimization can insert a conditionally-executed direct call to the frequently-targeted function, and the direct call can be inlined.

These apparently improves the prediction whether or not some optimizations pay off. No direct tradeoff for non-profiled code paths.

Basic Block Optimization – Basic block optimization allows commonly executed basic blocks that temporally execute within a given frame to be placed in the same set of pages (locality). This minimizes the number of pages used, thus minimizing memory overhead.

Function Layout – Based on the call graph and profiled caller/callee behavior, functions that tend to be along the same execution path are placed in the same section.

Dead Code Separation – Code that is not called during profiling is moved to a special section that is appended to the end of the set of sections. This effectively keeps this section out of the often-used pages.

EH Code Separation – The EH code, being exceptionally executed, can often be moved to a separate section when profile-guided optimizations can determine that the exceptions occur only on exceptional conditions.

All of this may reduce locality of non-profiled code paths. In my experience, the impact would be noticable or severe if this code path has a tight loop that does exceed L1 code cache (and maybe even thrashes L2). That sounds exactly like a path that should have been included in a PGO profile :)

Dead Code separation can have a huge impact - both ways - because it can reduce disk access.

If you rely on exceptions being fast, you are doing it wrong.

Size/Speed Optimization – Functions where the program spends a lot of time can be optimized for speed.

The rule of thumb nowadays is to "optimize for size by default, and only optimize for speed where needed (and verify it helps). The reason is again code cache - in most cases, the smaller code will also be the faster code, because of code cache. So this kind of automates what you should do manually. Compared to a global speed optimization, this would slow down non-profiled code paths only in very atypical cases ("weird code" or a target machine with unusual cache behavior).

Conditional Branch Optimization – With the value probes, profile-guided optimizations can find if a given value in a switch statement is used more often than other values. This value can then be pulled out of the switch statement. The same can be done with if/else instructions where the optimizer can order the if/else so that either the if or else block is placed first depending on which block is more frequently true.

I would file that under "improved prediction", too, unless you feed the wrong PGO information.

The typical case where this can pay a lot are run time parameter / range validation and similar paths that should never be taken in a normal execution.

The breaking case would be:

if (x > 0) DoThis() else DoThat();

in a relevant tight loop and profiling only the x > 0 case.

Memory Intrinsics – The expansion of intrinsics can be decided better if it can be determined if an intrinsic is called frequently. An intrinsic can also be optimized based on the block size of moves or copies.

Again, mostly better informaiton with a small possibility of penalizing untested data.

Example: - this is all an "educated guess", but I think it's quite illustrativefor the entire topic.

Assume you have a memmove that is always called on well aligned non-overlapping buffers with a length of 16 bytes.

A possible optimization is verifying these conditions and use inlined MOV instructions for this case, calling to a general memmove (handling alignment, overlap and odd length) only when the conditions are not met.

The benefits can be significant in a tight loop of copying structs around, as you improve locality, reduce expected path instruction, likely with more chances for pairing/reordering.

The penalty is comparedly small, though: in the general case without PGO, you would either always call the full memmove, or nline the full memmove implementation. The optimization adds a few instructions (including a conditional jump) to something rather complex, I'd assume a 10% overhead at most. In most cases, these 10% will be below the noise due to cache access.

However, there is a very slight slight chance for significant impact if the unexpected branch is taken frequently and the additional instructions for the expected case together with the instructions for the default case push a tight loop out of the L1 code cache

Note that you are already at the limits of what the compiler could do for you. The additional instructions can be expected to be a few bytes, compared to a few K in code cache. A static optimizer could hit the same fate depending on how well it can hoist invariants - and how much you let it.

Conclusion:

Many of the optimizations are neutral.
Some optimizations can have slight negative impact on non-profiled code paths
The impact is usually much smaller than the possible gains
Very rarely, a small impact can be emphasized by other contributing pathological factors
Few optimizations (namely, layout of code sections) can have large impact, but again the possible gains signidicantly outweight that

My gut feel would further claim that

A static optimizer, on a whole, would be at least equally likely to create a pathological case
It would be pretty hard to actually destroy performance even with bad PGO input.

At that level, I would be much more afraid of PGO implementation bugs/shortcomings than of failed PGO optimizations.

krolth · Answer

PGO can most certainly affect the run time of the code that is run less frequently. After all you are modifying the locality of some functions/blocks and that will make the blocks that are now together to be more cache friendly.

What I have seen is that teams identify their high priority scenarios. Then they run those to train the optimization profiler and measure the improvement. You don't want to run all the scenarios under PGO because if you do you might as well not run any.

As in everything related to performance you need to measure before you apply it. Masure your most common scenarios to see if they improved at all by using PGO training. And also measure the less common scenarios to see if they regressed at all.

Does profile-guided optimization done by compiler notably hurt cases not covered with profiling dataset?

Tags:

performance

language-agnostic

optimization

compiler-optimization

profiling

sharptooth

2 Answers

peterchen

krolth

Recent Activity

Donate For Us

Does profile-guided optimization done by compiler notably hurt cases not covered with profiling dataset?

Tags:

performance

language-agnostic

optimization

compiler-optimization

profiling

sharptooth

2 Answers

peterchen

krolth

Related questions

Recent Activity

Donate For Us