Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What can a compiler do with branching information?

On a modern Pentium it is no longer possible to give branching hints to the processor it seems. Assuming that a profiling compiler such as gcc with profile-guided optimization gains information about likely branching behavior, what can it do to produce code that will execute more quickly?

The only option I know of is to move unlikely branches to the end of a function. Is there anything else?

Update.

http://download.intel.com/products/processor/manual/325462.pdf volume 2a, section 2.1.1 says

"Branch hint prefixes (2EH, 3EH) allow a program to give a hint to the processor about the most likely code path for a branch. Use these prefixes only with conditional branch instructions (Jcc). Other use of branch hint prefixes and/or other undefined opcodes with Intel 64 or IA-32 instructions is reserved; such use may cause unpredictable behavior."

I don't know if these actually have any effect however.

On the other hand section 3.4.1. of http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf says

" Compilers generate code that improves the efficiency of branch prediction in Intel processors. The Intel C++ Compiler accomplishes this by:

  • keeping code and data on separate pages
  • using conditional move instructions to eliminate branches
  • generating code consistent with the static branch prediction algorithm
  • inlining where appropriate
  • unrolling if the number of iterations is predictable

With profile-guided optimization, the compiler can lay out basic blocks to eliminate branches for the most frequently executed paths of a function or at least improve their predictability. Branch prediction need not be a concern at the source level. For more information, see Intel C++ Compiler documentation. "

http://cache-www.intel.com/cd/00/00/40/60/406096_406096.pdf says in "Performance Improvements with PGO "

" PGO works best for code with many frequently executed branches that are difficult to predict at compile time. An example is the code with intensive error-checking in which the error conditions are false most of the time. The infrequently executed (cold) errorhandling code can be relocated so the branch is rarely predicted incorrectly. Minimizing cold code interleaved into the frequently executed (hot) code improves instruction cache behavior."

like image 599
marshall Avatar asked May 30 '13 11:05

marshall


2 Answers

If it's clear that a loop is rarely entered, or that it normally iterates very few times, then the compiler might avoid unrolling the loop, as doing so can add a lot of harmful complexity to handle edge conditions (an odd-number iterations, etc.). Vectorisation, in particular, should be avoided in such cases.

The compiler might rearrange nested tests, so that the one that most frequently results in a short-cut can be used to avoid performing a test on something with a 50% pass rate.

Register allocation can be optimised to avoid having a rarely-used block force register spill in the common case.

These are just some examples. I'm sure there are others I haven't thought of.

like image 44
sh1 Avatar answered Sep 19 '22 15:09

sh1


There are two possible sources for the information you want:

  1. There's Intel 64 and IA-32 Architectures Software Developer's Manual (3 volumes). This is a huge work which has evolved for decades. It's the best reference I know on a lot of subjects, including floating-point. In this case, you want to check volume 2, the instruction set reference.
  2. There's Intel 64 and IA-32 Architectures Optmization Reference Manual. This will tell you in somewhat brief terms what to expect from each microarchitecture.

Now, I don't know what you mean by a "modern Pentium" processor, this is 2013, right? There aren't any Pentiums anymore...

The instruction set does support telling the processor if the branch is expected to be taken or not taken by a prefix to the conditional branch instructions (such as JC, JZ, etc). See volume 2A of (1), section 2.1.1 (of the version I have) Instruction Prefixes. There is the 2E and 3E prefixes for not taken and taken respectively.

As to whether these prefixes actually have any effect, if we can get that information, it will be on Optimization Reference Manual, the section for the microarchitecture you want (and I'm sure it won't be the Pentium).

Apart from using those, there is an entire section on the Optimization Reference Manual on that subject, that's section 3.4.1 (of the version I have).

It makes no sense to reproduce that here, since you can download the manual for free. Briefly:

  • Eliminate branches by using conditional instructions (CMOV, SETcc),
  • Consider the static prediction algorithm (3.4.1.3),
  • Inlining
  • Loop unrolling

Also, some compilers, GCC, for instance, even when CMOV is not possible, often perform bitwise arithmetic to select one of two distinct things computed, thus avoiding branches. It does this particularly with SSE instructions when vectorizing loops.

Basically, the static conditions are:

  • Unconditional branches are predicted to be taken (... kind of expectable...)
  • Indirect branches are predicted not to be taken (because of a data dependency)
  • Backward conditionals are predicted to be taken (good for loops)
  • Forward conditionals are predicted not to be taken

You probably want to read the entire section 3.4.1.

like image 92
migle Avatar answered Sep 21 '22 15:09

migle