If I have a chip that is subject to the Intel jcc erratum, how I can enable the mitigation in gcc (which adjusts branch locations to avoid the problematic alignment), and which gcc versions support it?
By compiler:
-Wa,-mbranches-within-32B-boundaries
-mbranches-within-32B-boundaries
compiler option directly, not -Wa
./QIntel-jcc-erratum
See Intel JCC Erratum - what is the effect of prefixes used for mitigation?
The GNU toolchain does mitigation in the assembler, with as -mbranches-within-32B-boundaries
, which enables (GAS manual: x86 options):
-malign-branch-boundary=32
(care about 32-byte boundaries). Except the manual says this option takes an exponent, not a direct power of 2, so probably it's actually ...boundary=5
.-malign-branch=jcc+fused+jmp
(the default which does not include any of +call+ret+indirect
)-malign-branch-prefix-size=5
(up to 5 segment prefixes per insn).So the relevant GCC invocation is gcc -Wa,-mbranches-within-32B-boundaries
Unfortunately, GCC -mtune=skylake
doesn't enable this.
GAS's strategy seems to be to pad as early as possible after the last alignment directive (e.g. .p2align
) or after the last jcc/jmp that can end before a 32B boundary. I guess that might end up with padding in outer loops, before or after inner loops, maybe helping them fit in fewer uop cache lines? (Skylake also has its LSD loop buffer disabled, so a tiny loop split across two uop cache lines can run at best 2 cycles per iteration, instead of 1.)
It can lead to quite a large amount of padding with long macro-fused jumps, such as with -fstack-protector-strong
which in recent GCC uses sub rdx,QWORD PTR fs:0x28
/ jnz
(earlier GCC used to use xor
, which can't fuse even on Intel). That's 11 bytes total of sub + jnz, so could require 11 bytes of CS prefixes in the worst case to shift it to the start of a new 32B block. Example showing 8 CS prefixes in the insns before it: https://godbolt.org/z/n1dYGMdro
GCC doesn't know instruction sizes, it only prints text. That's why it needs GAS to support stuff like .p2align 4,,10
to align by 16 if that will take fewer than 10 bytes of padding, to implement the alignment heuristics it wants to use. (Often followed by .p2align 3
to unconditionally align by 8.)
as
has other fun options that aren't on by default, like -Os
to optimize hand-written asm like mov $1, %rax
=> mov $1, %eax
/ xor %rax,%rax
=> %eax
/ test $1, %eax
=> al
and even EVEX => VEX for stuff like vmovdqa64 => vmovdqa.
Also stuff like -msse2avx
to always use VEX prefixes even when the mnemonic isn't v...
, and -momit-lock-prefix=yes
which could be used to build std::atomic code for a uniprocessor system.
And -mfence-as-lock-add=yes
to assemble mfence
into lock addl $0x0, (%rsp)
. But insanely it also does that for sfence
and even lfence
, so it's unusable in code that uses lfence
as an execution barrier, which is the primary use-case for lfence
. e.g. for retpolines or timing like lfence;rdtsc
.
as
also has CPU feature-level checking with -march=znver3
for example, or .arch
directives. And -mtune=CPU
, although IDK what that does. Perhaps set NOP strategy?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With