I’m currently coding highly optimised versions of some C99 standard library string functions, like strlen()
, memset()
, etc, using x86-64 assembly with SSE-2 instructions.
So far I’ve managed to get excellent results in terms of performance, but I sometimes get weird behaviour when I try to optimise more.
For instance, adding or even removing some simple instructions, or simply reorganising some local labels used with jumps completely degrades the overall performances. And there’s absolutely no reason in terms of code.
So my guess is that there is some issues with code alignment, and/or with branches which get mispredicted.
I know that, even with the same architecture (x86-64), different CPUs have different algorithms for branch prediction.
But is there some general advices, when developing for high performances on x86-64, about code alignment and branch prediction?
In particular, about alignment, should I ensure all labels used by jump instructions are aligned on a DWORD?
_func:
; ... Some code ...
test rax, rax
jz .label
; ... Some code ...
ret
.label:
; ... Some code ...
ret
In the previous code, should I use an align directive before .label:
, like:
align 4
.label:
If so, is it enough to align on a DWORD when using SSE-2?
And about branch prediction, is there a «preffered» way to organize the labels used by jump instructions, in order to help the CPU, or are today's CPUs smart enough to determine that at runtime by counting the number of times a branch is taken?
EDIT
Ok, here's a concrete example - here's the start of strlen()
with SSE-2:
_strlen64_sse2:
mov rsi, rdi
and rdi, -16
pxor xmm0, xmm0
pcmpeqb xmm0, [ rdi ]
pmovmskb rdx, xmm0
; ...
Running it 10'000'000 times with a 1000 character string gives about 0.48 seconds, which is fine.
But it does not check for a NULL string input. So obviously, I'll add a simple check:
_strlen64_sse2:
test rdi, rdi
jz .null
; ...
Same test, it runs now in 0.59 seconds. But if I align the code after this check:
_strlen64_sse2:
test rdi, rdi
jz .null
align 8
; ...
The original performances are back. I used 8 for alignment, as 4 doesn't change anything.
Can anyone explain this, and give some advices about when to align, or not to align code sections?
EDIT 2
Of course, it's not as simple as aligning every branch target. If I do it, performances will usually get worse, unless some specific cases like above.
The purpose of the branch predictor is to improve the flow in the instruction pipeline. Branch predictors play a critical role in achieving high performance in many modern pipelined microprocessor architectures such as x86.
Branch prediction buffers contain prediction about whether the next branch will be taken (T) or not (NT), but it does not supply the target PC value. A Branch Target Buffer (BTB) does this. Instr address Predicted PC. BTB is a cache that holds. (instr addr, predicted PC)
Branch prediction breaks instructions down into predicates, similar to predicate logic. A CPU using branch prediction only executes statements if a predicate is true. One example is using conditional logic. Since unnecessary code is not executed, the processor can work much more efficiently.
A branch instruction is not inherently slower than any other instruction. However, the reason you heard that branches should avoided is because modern CPUs follow a pipeline architecture. This means that there are multiple sequential instructions being executed simultaneously.
.p2align <abs-expr> <abs-expr> <abs-expr>
instead of align
.Grants fine-grained control using its 3 params
NOP
s).NOP
s for padding to reduce the time spent executing NOP
s. /* nop */
static const char nop_1[] = { 0x90 };
/* xchg %ax,%ax */
static const char nop_2[] = { 0x66, 0x90 };
/* nopl (%[re]ax) */
static const char nop_3[] = { 0x0f, 0x1f, 0x00 };
/* nopl 0(%[re]ax) */
static const char nop_4[] = { 0x0f, 0x1f, 0x40, 0x00 };
/* nopl 0(%[re]ax,%[re]ax,1) */
static const char nop_5[] = { 0x0f, 0x1f, 0x44, 0x00, 0x00 };
/* nopw 0(%[re]ax,%[re]ax,1) */
static const char nop_6[] = { 0x66, 0x0f, 0x1f, 0x44, 0x00, 0x00 };
/* nopl 0L(%[re]ax) */
static const char nop_7[] = { 0x0f, 0x1f, 0x80, 0x00, 0x00, 0x00, 0x00 };
/* nopl 0L(%[re]ax,%[re]ax,1) */
static const char nop_8[] =
{ 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00};
/* nopw 0L(%[re]ax,%[re]ax,1) */
static const char nop_9[] =
{ 0x66, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 };
/* nopw %cs:0L(%[re]ax,%[re]ax,1) */
static const char nop_10[] =
{ 0x66, 0x2e, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 };
(upto 10byte NOP
s for x86. Source binutils-2.2.3.)
Lot of variations between x86_64 micro-architectures/generations. However a common set of guidelines that are applicable for all of them can be summarised as follows. Reference : Section 3 of Agner Fog's x86 micro-architecture manual.
Loop detection logic is guaranteed to work ONLY for loops with < 64 iterations. This is due to the fact that a branch instruction is recognized as having loop behavior if it goes one way n-1 times and then goes the other way 1 time, for any n upto 64.
This doesn't really apply for the predictors in Haswell and later which use a TAGE predictor and doesn't have dedicated loop-detection logic for specific branches. Iteration counts of ~23 can be the worst-case for an inner loop inside a tight outer loop with no other branching, on Skylake: the exit from the inner loop mispredicts most times, but the trip count is so low that it happens often. Unrolling can help by shortening the pattern, but for very high loop trip counts the single mispredict at the end is amortized over a lot of trips and it would take an unreasonable amount of unrolling to do anything about it.
Far jumps are not predicted i.e. pipeline always stalls on a far jump to a new code segment (CS:RIP). There's basically never a reason to use a far jump anyway so this is mostly not relevant.
Indirect jumps with an arbitrary 64-bit absolute address are predicted normally on most CPUs.
But Silvermont (Intel's low-power CPUs) have some limitations in predicting indirect jumps when the target is more than 4GB away, so avoiding that by loading/mapping executables and shared libraries in the low 32 bits of virtual address space can be a win there. e.g. on GNU/Linux by setting the environment variable LD_PREFER_MAP_32BIT_EXEC
. See Intel's optimization manual for more.
To extends on TheCodeArtist's answer, who made some good points, here are a few additional stuff and details, as I was actually able to solve the problem.
1 - Code alignment
Intel recommends aligning code and branch targets on 16-byte boundaries:
3.4.1.5 - Assembly/Compiler Coding Rule 12. (M impact, H generality)
All branch targets should be 16-byte aligned.
While this is usually a good advice, it should be done carefully.
Blindly 16-byte aligning everything may lead to performance lost, so this should be tested on each branch target before applying.
As TheCodeArtist pointed it out, using multi-byte NOPs may help here, as simply using standard one-byte NOPs may not bring the expected performance gain of code alignment.
As a sidenote, the .p2align
directive is not available in NASM or YASM.
But they do support alignment with other instructions than NOPs with the standard align
directive:
align 16, xor rax, rax
2 . Branch prediction
This turned out to be the most important part.
While it's right that every generation of x86-64 CPUs have different branch prediction algorithms, some simple rules can be applied generally to help the CPU predicting which branch will likely be taken.
The CPU tries to keep a branching history in the BTB (Branch Target Buffer).
But when branch information is not available in the BTB, the CPU will use what they call static prediction, which obey to simple rules, as mentioned in Intel's manuals:
Here's an example for the first case:
test rax, rax
jz .label
; Fallthrough - Most likely
.label:
; Forward branch - Most unlikely
Instructions under .label
is the unlikely condition, because .label
is declared after the actual branch.
For the second case:
.label:
; Backward branch - Most likely
test rax, rax
jz .label
; Fallthrough - Most unlikely
Here, the instructions under .label
are the likely condition, as .label
is declared before the actual branch.
So each conditional branch should always follow this simple pattern.
And of course, this is also suitable for loops.
As I mentioned before, this was the most important part.
I was experiencing unpredictable performance gains or losses while adding simple tests that should logically improve the overall performances.
Sticking blindly to these rules solved the issues.
If not, the addition of a branch for optimisation purpose may have the opposite result.
TheCodeArtist also mentions loop unrolling in his answer.
While this wasn't the issue, as my loops were already unrolled, I mention it here as it's indeed extremely important, and brings substantial performance gains.
And as a last note for the readers, while this may seem obvious and wasn't the problem here, don't branch when unnecessary.
Starting with the Pentium Pro, x86 processors have conditional move instructions, which may help to eliminate branching and suppress the risk of misprediction:
test rax, rax
cmovz rbx, rcx
So just in case, nice thing to keep in mind.
To get a better understanding of why and how alignment matters, check out Agner Fog's the microarchitecture doc, esp. the section about the instruction-fetch front-end of various CPU designs. Sandybridge introduced the uop cache, which makes a huge different to throughput, esp. in SSE code where instruction length is often too long for 16B per cycle to cover 4 instructions.
The rules for filling uop cache lines are complicated, but a new block of 32B of instructions always starts a new cache line, IIRC. So aligning hot function entry points to 32B is a good idea. That much padding in other cases might be hurting I$ density more than helping. (L1 I$ still has 64B cache lines, though, so some things might hurt L1 I$ density while helping uop cache density.)
The loop buffer helps too, but taken branches disrupt the 4 uops per cycle, especially before Haswell. e.g. a loop of 3 uops executes like abc
, abc
, not abca
, bcda
on SnB/IvB. So a 5-uop loop goes at one iteration per 2 cycles, not one per 1.25. This makes unrolling even more valuable. (Haswell and later seem to unroll tiny loops in the LSD, making a 5-uop loop a lot less bad: Is performance reduced when executing loops whose uop count is not a multiple of processor width?)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With