Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How has CPU architecture evolution affected virtual function call performance?

Years ago I was learning about x86 assembler, CPU pipelining, cache misses, branch prediction, and all that jazz.

It was a tale of two halves. I read about all the wonderful advantages of the lengthy pipelines in the processor viz instruction reordering, cache preloading, dependency interleaving, etc.

The downside was that any deviation for the norm was enormously costly. For example, IIRC a certain AMD processor in the early-gigahertz era had a 40 cycle penalty every time you called a function through a pointer (!) and this was apparently normal.

This is not a negligible "don't worry about it" number! Bear in mind that "good design" normally means "factor your functions as much as possible" and "encode semantics in the data types" which often implies virtual interfaces.

The trade-off is that code which doesn't perform such operations might get more than two instructions per cycle. These are numbers one wants to worry about when writing high-performance C++ code which is heavy on the object design and light on the number crunching.

I understand that the long-CPU-pipeline trend has been reversing as we enter the low-power era. Here's my question:

Does the latest generation of x86-compatible processors still suffer massive penalties for virtual function calls, bad branch predictions, etc?

like image 732
spraff Avatar asked Aug 30 '11 10:08

spraff


1 Answers

AMD processor in the early-gigahertz era had a 40 cycle penalty every time you called a function

Huh.. so large..

There is an "Indirect branch prediction" method, which helps to predict virtual function jump, IF there was the same indirect jump some time ago. There is still a penalty for first and mispredicted virt. function jump.

Support varies from simple "predicted right if and only if the previous indirect branch was exactly the same" to very complex two-level tens or hundreds entries with detecting of periodic alternation of 2-3 target address for single indirect jmp instruction.

There was a lot of evolution here...

http://arstechnica.com/hardware/news/2006/04/core.ars/7

first introduced with the Pentium M: ... indirect branch predictor.

The indirect branch predictor

Because indirect branches load their branch targets from a register, instead of having them immediately available as is the case with direct branches, they're notoriously difficult to predict. Core's indirect branch predictor is a table that stores history information about the preferred target addresses of each indirect branch that the front end encounters. Thus when the front-end encounters an indirect branch and predicts it as taken, it can ask the indirect branch predictor to direct it to the address in the BTB that the branch will probably want.

http://www.realworldtech.com/page.cfm?ArticleID=rwt051607033728&p=3

Indirect branch prediction was first introduced with Intel’s Prescott microarchitecture and later the Pentium M.

between 16-50% of all branch mispredicts were indirect (29% on average). The real value of indirect branch misprediction is for many of the newer scripting or high level languages, such as Ruby, Perl or Python, which use interpreters. Other common indirect branch common culprits include virtual functions (used in C++) and calls to function pointers.

http://www.realworldtech.com/page.cfm?ArticleID=RWT102808015436&p=5

AMD has adopted some of these refinements; for instance adding indirect branch predictor arrays in Barcelona and later processors. However, the K8 has older and less accurate branch predictors than the Core 2.

http://www.agner.org/optimize/microarchitecture.pdf

3.12 Indirect jumps on older processors Indirect jumps, indirect calls, and returns may go to a different address each time. The prediction method for an indirect jump or indirect call is, in processors older than PM and K10, simply to predict that it will go to the same target as last time it was executed.

and the same pdf, page 14

Indirect jump prediction An indirect jump or call is a control transfer instruction that has more than two possible targets. A C++ program can generate an indirect jump or call with... a virtual function. An indirect jump or call is generated in assembly by specifying a register or a memory variable or an indexed array as the destination of a jump or call instruction. Many processors make only one BTB entry for an indirect jump or call. This means that it will always be predicted to go to the same target as it did last time. As object oriented programming with polymorphous classes has become more common, there is a growing need for predicting indirect calls with multiple targets. This can be done by assigning a new BTB entry for every new jump target that is encountered. The history buffer and pattern history table must have space for more than one bit of information for each jump incident in order to distinguish more than two possible targets. The PM is the first x86 processor to implement this method. The prediction rule on p. 12 still applies with the modification that the theoretical maximum period that can be predicted perfectly is mn, where m is the number of different targets per indirect jump, because there are mn different possible n-length subsequences. However, this theoretical maximum cannot be reached if it exceeds the size of the BTB or the pattern history table.

Agner's manual has a longer description of branch predictor in many modern CPUs and the evolution of predictor in cpus of every manufacturer (x86/x86_64).

Also a lot of theoretical "indirect branch prediction" methods (look in the Google scholar); even wiki said some words about it http://en.wikipedia.org/wiki/Branch_predictor#Prediction_of_indirect_jumps /

For Atoms from the agner's micro:

Prediction of indirect branches The Atom has no pattern predictor for indirect branches according to my tests. Indirect branches are predicted to go to the same target as last time.

So, for low power, indirect branch prediction is not so advanced. So does Via Nano:

Indirect jumps are predicted to go to the same target as last time.

I think, that shorter pipeline of lowpower x86 has lower penalty, 7-20 ticks.

like image 193
osgx Avatar answered Sep 28 '22 09:09

osgx