To get an understanding on why Bulldozer was subpar I've been looking at Agner Fog's excellent microarchitecture book, in it on page 178 under bulldozer it has this paragraph.
Instructions with up to three prefixes can be decoded in one clock cycle. There is a very large penalty for instructions with more than three prefixes. Instructions with 4-7 prefixes take 14-15 clock cycles extra to decode. Instructions with 8-11 prefixes take 20-22 clock cycles extra, and instructions with 12-14 prefixes take 27 - 28 clock cycles extra. It is therefore not recommended to make NOP instructions longer with more than three prefixes. The prefix count for this rule includes operand size, address size, segment, repeat, lock, REX and XOP prefixes. A three-bytes VEX prefix counts as one, while a two-bytes VEX prefix does not count. Escape codes (0F, 0F38, 0F3A) do not count.
When I searched for prefixes I was hit with very technical definitions far and away beyond my abilities. Or, suggested that they were limited to 4 per instruction which conflicts with the above extract.
So in simple terms, can someone explain what they are/do and why you might want to tack on up to 14+ onto an instruction instead of breaking it up?
Normally you use as many as needed, with the intended instruction and operands determining that. The assembler issues some of the prefixes automatically, while others you get to use manually.
The case they mention is for multi-byte NOP
which is traditionally used for alignment padding where the idea is to use a single but appropriately long instruction to conserve resources. Apparently it turns out that using more prefixes just to keep it a single instruction may be worse performer than using two instructions with less prefixes.
The prefix count for this rule includes operand size, address size, segment, repeat, lock, REX and XOP prefixes. A three-bytes VEX prefix counts as one, while a two-bytes VEX prefix does not count.
Examples:
mov ax, [foo]
is encoded the same as mov eax, [foo]
but with the prefix 66h
mov [eax], foo
is encoded the same as mov [rax], foo
but with the prefix 67h
(in 64 bit mode)mov [fs:eax], foo
is encoded the same as mov [eax], foo
but with the prefix 64h
.rep cmpsb
is the encoded the same as cmpsb
but with the prefix f3h
lock add [foo], 1
is encoded the same as add [foo], 1
but with the prefix f0h
add rax, 1
is encoded the same as add eax, 1
but with the prefix 48h
add r8d, 1
is the same as add eax, 1
but with the prefix 41h
The "four prefixes" deal comes from the "prefix groups":
You can repeat prefixes, but you cannot (you can, but the behaviour is undefined) use several different prefixes from the same group. Though that only applies to groups 1 and 2, the other groups have only 1 thing in them each.
Something like 66 66 66 66 66 66 66 66 90
is valid (but potentially slow to decode). 2E 3E 00 00
(mixing segment overrides) is not.
Stacking prefixes can be useful for code alignment when the bytes have to be executed, unlike padding with nop
it doesn't cost execution time. Using too many at once may cost decoding time.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With