Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

repz ret: why all the hassle?

The issue of the repz ret has been covered here [1] as well as in other sources [2, 3] quite satisfactorily. However, reading neither of these sources, I found answers to the following:

  1. What is the actual penalty in a quantitative comparison with ret or nop; ret? Especially in the latter case – is decoding one extra instruction (and an empty one at that!) really relevant, when most functions either have 100+ of those or get inlined?

  2. Why did this never get fixed in AMD K8, and even made its way into K10? Since when is documenting an ugly workaround based on a behaviour that is and stays undocumented preferred to actually fixing the issue, when every detail of the cause is known?

like image 864
The Vee Avatar asked Mar 11 '23 00:03

The Vee


1 Answers

Branch misprediction
The reason for all the hoopla is the cost of branch mispredictions.
When a branch comes around the CPU predicts the branch taken and preloads these instructions in the pipeline.
If the prediction is wrong the pipeline needs to be cleared and new instructions loaded.
This can take up to number_of_stages_in_pipeline cycles plus any cycles needed to load the data from the cache. 14 to 25 cycles per misprediction is typical.

Reason: processor design
The reason K8 and K10 suffer from this is because of a nifty optimization by AMD.
AMD K8 and K10 will pre-decode instructions in the cache and keep track of their length in the CPU L1 instruction cache.
In order to do this it has extra bits.

For every 128 bits (16 bytes) of instructions there are 76 bits of additional data stored.

The following table details this:

Data             Size       Notes
-------------------------------------------------------------------------
Instructions     128 bits   The data as read from memory
Parity bits      8 bits     One parity bit for every 16 bits
Pre-decode       56 bits    3 bits per byte (start, end, function) 
                            + 4 bit per 16 byte line
Branch selectors 16 bits    2 bits for each 2 bytes of instruction code

Total            204 bits   128 instructions, 76 metadata

Because all this data is stored in the L1 instruction cache the K8/10 cpu has to spend a lot less work on decode and branch prediction. This saves on silicon.
And because AMD does not have as big a transistor's budget as Intel it needs to work smarter.

However if the code is esp. tight a jump and a ret might occupy the same two byte slot, meaning that there the RET gets predicted as NOT taken (because the jump following it is).
By making the RET occupy two bytes REP RET this can never occur and a RET will always be predicted OK.

Intel does not have this problem, but (used to) suffer(s) from a limited number of prediction slots, which AMD does not.

nop ret
There is never a reason to do nop ret. This is two instructions wasting an extra cycle to execute the nop and the ret might still 'pair' with a jump.
If you want to align use a REP MOV instead or use a multibyte nop.

Closing remarks
Only the local branch prediction is stored with instructions in the cache.
There is a separate Global branch prediction table as well.

like image 55
Johan Avatar answered Mar 20 '23 15:03

Johan