gcc 5.3 with -O3 -mavx -mtune=haswell
for x86-64 makes surprisingly bulky code to handle potentially-misaligned inputs for code like:
// convenient simple example of compiler input
// I'm not actually interested in this for any real program
void floatmul(float *a) {
for (int i=0; i<1024 ; i++)
a[i] *= 2;
}
clang uses unaligned load/store instructions, but gcc does a scalar intro/outro and an aligned vector loop: It peels off the first up-to-7 unaligned iterations, fully unrolling that into a sequence of
vmovss xmm0, DWORD PTR [rdi]
vaddss xmm0, xmm0, xmm0 ; multiply by two
vmovss DWORD PTR [rdi], xmm0
cmp eax, 1
je .L13
vmovss xmm0, DWORD PTR [rdi+4]
vaddss xmm0, xmm0, xmm0
vmovss DWORD PTR [rdi+4], xmm0
cmp eax, 2
je .L14
...
This seems pretty terrible, esp. for CPUs with a uop cache. I reported a gcc bug about this, with a suggestion for smaller/better code that gcc could use when peeling unaligned iterations. It's probably still not optimal, though.
This question is about what actually would be optimal with AVX. I'm asking about general-case solutions that gcc and other compilers could/should use. (I didn't find any gcc mailing list hits with discussion about this, but didn't spend long looking.)
There will probably be multiple answers, since what's optimal for -mtune=haswell
will probably be different from what's optimal for -mtune=bdver3
(steamroller). And then there's the question of what's optimal when allowing instruction set extensions (e.g. AVX2 for 256b integer stuff, BMI1 for turning a count into a bitmask in fewer instructions).
I'm aware of Agner Fog's Optimizing Assembly guide, Section 13.5 Accessing unaligned data and partial vectors. He suggests either using unaligned accesses, doing an overlapping write at the start and/or end, or shuffling data from aligned accesses (but PALIGNR
only takes an imm8 count, so 2x pshufb
/ por
). He discounts VMASKMOVPS
as not useful, probably because of how badly it performs on AMD. I suspect that if tuning for Intel, it's worth considering. It's not obvious how to generate the correct mask, hence the question title.
It might turn out that it's better to simply use unaligned accesses, like clang does. For short buffers, the overhead of aligning might kill any benefit from avoiding cacheline splits for the main loop. For big buffers, main memory or L3 as the bottleneck may hide the penalty for cacheline splits. If anyone has experimental data to back this up for any real code they've tuned, that's useful information too.
VMASKMOVPS
does look usable for Intel targets. (The SSE version is horrible, with an implicit non-temporal hint, but the AVX version doesn't have that. There's even a new intrinsic to make sure you don't get the SSE version for 128b operands: _mm128_maskstore_ps
) The AVX version is only a little bit slow on Haswell:
The store form is still unusably slow on AMD CPUs, both Jaguar (1 per 22c tput) and Bulldozer-family: 1 per 16c on Steamroller (similar on Bulldozer), or 1 per ~180c throughput on Piledriver.
But if we do want to use VMASKMOVPS
, we need a vector with the high bit set in each element that should actually be loaded/stored. PALIGNR and PSRLDQ (for use on a vector of all-ones) only take compile-time-constant counts.
Notice that the other bits don't matter: it doesn't have to be all-ones, so scattering some set bits out to the high bits of the elements is a possibility.
Thanks to @StephenCanon for pointing out that this is better than VMASKMOVPS
for anything that VMASKMOVPS
could do to help with looping over unaligned buffers.
This is maybe a bit much to expect a compiler to do as a loop transformation, esp. since the obvious way can make Valgrind unhappy (see below).
section .text
global floatmul ; (float *rdi)
floatmul:
lea rdx, [rdi + 4096 - 32] ; one full vector less than the end address (calculated *before* masking for alignment).
;; replace 4096 with rsi*4 if rsi has the count (in floats, not bytes)
vmovups ymm0, [rdi] ; first vector
vaddps ymm0, ymm0, ymm0 ; *= 2.0
; don't store yet
lea rax, [rdi+32]
and rax, ~0x1c ; 0x1c = 7 << 2 = 0b11100 ; clear those bits.
vmovups ymm1, [rax] ; first aligned vector, for use by first loop iteration
vmovups [rdi], ymm0 ; store the first unaligned vector
vmovups ymm0, [rdx] ; load the *last* unaligned vector
.loop:
;; on entry: [rax] is already loaded into ymm1
vaddps ymm1, ymm1, ymm1 ; *= 2.0
vmovups [rax] ; vmovaps would fault if p%4 != 0
add rax, 32
vmovups ymm1, [rax]
cmp rax, rdx ; while( (p+=8) < (endp-8) );
jb .loop
; discard ymm1. It includes data from beyond the end of the array (aligned case: same as ymm0)
vaddps ymm0, ymm0, ymm0 ; the last 32B, which we loaded before the loop
vmovups [rdx], ymm0
ret
; End alignment:
; a[] = XXXX XXXX ABCD E___ _ = garbage past the end
; ^rdx
; ^rax ^rax ^rax ^rax(loop exit)
; ymm0 = BCDE
; ymm1 loops over ..., XXXX, ABCD, E___
; The last load off the end of the array includes garbage
; because we pipeline the load for the next iteration
Doing a load from the end of the array at the start of the loop seems a little weird, but hopefully it doesn't confuse the hardware prefetchers, or slow down getting the beginning of the array streaming from memory.
2 extra integer uops total (to set up the aligned-start). We're already using the end pointer for the normal loop structure, so that's free.
2 extra copies of the loop body (load/calc/store). (First and last iteration peeled).
Compilers probably won't be happy about emitting code like this, when auto-vectorizing. Valgrind will report accesses outside of array bounds, and does so by single-stepping and decoding instructions to see what they're accessing. So merely staying within the same page (and cache line) as the last element in the array isn't sufficient. Also note that if the input pointer isn't 4B-aligned, we can potentially read into another page and segfault.
To keep Valgrind happy, we could stop the loop two vector widths early, to do the special-case load of the unaligned last vector-width of the array. That would require duplicating the loop body an extra time (insignificant in this example, but it's trivial on purpose.) Or maybe avoid pipelining by having the intro code jump into the middle of the loop. (That may be sub-optimal for the uop-cache, though: (parts of) the loop body may end up in the uop cache twice.)
TODO: write a version that jumps into the loop mid-way.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With