The issue of the repz ret
has been covered here [1] as well as in other sources [2, 3] quite satisfactorily. However, reading neither of these sources, I found answers to the following:
What is the actual penalty in a quantitative comparison with ret
or nop; ret
? Especially in the latter case – is decoding one extra instruction (and an empty one at that!) really relevant, when most functions either have 100+ of those or get inlined?
Why did this never get fixed in AMD K8, and even made its way into K10? Since when is documenting an ugly workaround based on a behaviour that is and stays undocumented preferred to actually fixing the issue, when every detail of the cause is known?
Branch misprediction
The reason for all the hoopla is the cost of branch mispredictions.
When a branch comes around the CPU predicts the branch taken and preloads these instructions in the pipeline.
If the prediction is wrong the pipeline needs to be cleared and new instructions loaded.
This can take up to number_of_stages_in_pipeline
cycles plus any cycles needed to load the data from the cache. 14 to 25 cycles per misprediction is typical.
Reason: processor design
The reason K8 and K10 suffer from this is because of a nifty optimization by AMD.
AMD K8 and K10 will pre-decode instructions in the cache and keep track of their length in the CPU L1 instruction cache.
In order to do this it has extra bits.
For every 128 bits (16 bytes) of instructions there are 76 bits of additional data stored.
The following table details this:
Data Size Notes
-------------------------------------------------------------------------
Instructions 128 bits The data as read from memory
Parity bits 8 bits One parity bit for every 16 bits
Pre-decode 56 bits 3 bits per byte (start, end, function)
+ 4 bit per 16 byte line
Branch selectors 16 bits 2 bits for each 2 bytes of instruction code
Total 204 bits 128 instructions, 76 metadata
Because all this data is stored in the L1 instruction cache the K8/10 cpu has to spend a lot less work on decode and branch prediction. This saves on silicon.
And because AMD does not have as big a transistor's budget as Intel it needs to work smarter.
However if the code is esp. tight a jump and a ret might occupy the same two byte slot, meaning that there the RET
gets predicted as NOT taken (because the jump following it is).
By making the RET occupy two bytes REP RET
this can never occur and a RET will always be predicted OK.
Intel does not have this problem, but (used to) suffer(s) from a limited number of prediction slots, which AMD does not.
nop ret
There is never a reason to do nop ret
. This is two instructions wasting an extra cycle to execute the nop
and the ret
might still 'pair' with a jump.
If you want to align use a REP MOV
instead or use a multibyte nop
.
Closing remarks
Only the local branch prediction is stored with instructions in the cache.
There is a separate Global branch prediction table as well.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With