Let's say I have some 16-bytes aligned structure, that just wraps 3xFloat32 array:
#[repr(C, align(16))]
pub struct Vector(pub [f32; 3]);
Now I want to divide two instances of it, like this:
use core::arch::x86_64;
let a = Vector([1f32, 2f32, 3f32]);
let b = Vector([4f32, 5f32, 6f32]);
let mut q = Vector([0f32, 0f32, 0ff32]);
unsafe {
let a1 = x86_64::_mm_load_ps(a.0.as_ptr());
let b1 = x86_64::_mm_load_ps(b.0.as_ptr());
let q1 = x86_64::_mm_div_ps(a1, b1);
x86_64::_mm_store_ps(q.0.as_mut_ptr(), q1);
}
It does division, but there's a problem: 4-th element contains garbage, that, among other things, can be signalling NaN. And if certain exceptions flags are unmasked, SIGFPE will be triggered. I want to avoid this somehow, without silencing signal completely. I.e. I either want to silence it only on 4-th pair of elements, or put there some sane values. What are the best, fastest way to do this? Or maybe there's better approach in general?
Generally nobody unmasks FP exceptions, otherwise you'd need shuffles to e.g. duplicate one of the elements so the top element is doing the same division as one of the other elements. Or has some other known safe thing.
Maybe you can get away with only shuffling the divisor, if you can assume the dividend is non-NaN in that element.
With AVX512 you could suppress exceptions for an element using zero-masking, but until then there's no such feature. Also AVX512 lets you override the rounding mode + Suppress All Exceptions (SAE) without masking, so you could make nearest-even explicit to get SAE. But that suppresses exceptions for all elements.
Seriously, don't enable FP exceptions. Compilers barely / don't know how to optimize in a way that's safe if the number of exceptions is a visible side-effect. e.g. GCC's -ftrapping-math
is on by default, but it's broken.
I wouldn't assume LLVM is any better; the default strict FP probably still does optimizations that could give one SIGFPE where the source would have raised 2 or 4. Maybe even optimizations that raise 0 where the source would raise 1, or vice versa, like GCC's broken and near-useless default.
Enabling FP exceptions might be useful for debugging, though, if you expect to never have any of a certain kind of exception. But you can probably deal with the occasional false positive from a SIMD instruction by ignoring ones with that source address.
If there's a tradeoff between performance and exception-correctness, most users of a library would rather that it maximized performance.
Even clearing and then checking the sticky FP masked-flags with fenv
stuff is rarely done, and requires controlled circumstances to make use of. I wouldn't have any expectations for a library function call, especially not one that used any SIMD.
You can get slowdowns from subnormals (aka denormals), if MXCSR doesn't have FTZ and DAZ set. (i.e. the normal case, unless you compiled with (the Rust equivalent of) -ffast-math
.)
Producing a NaN or +-Inf takes no extra time for typical x86 hardware with SSE / AVX instructions. (Fun fact: NaN is slow, too, with legacy with x87 math even on modern HW). So it's safe to _mm_or_ps
with a cmpps
result to create a NAN in some elements of a vector before a math operation, for example. Or _mm_and_ps
to create some zeros in the divisor before division.
But be careful about what garbage is in your padding because it could lead to spurious subnormals. 0.0
and NaN (all ones) are generally always safe.
Using only 3 out of 4 elements of a SIMD vector is usually a bad idea because it usually means you're using a single SIMD vector to hold a single geometry vector, instead of three vectors of 4 x
coords, 4 y
coords, and 4 z
coords.
Shuffles / horizontal stuff mostly costs extra instructions (except for broadcast loads of a scalar that was already in memory), but you often need a lot of shuffles if you're using SIMD this way. There are cases where you can't vectorize over an array of things, but you can still get a speedup with SIMD.
If you're just using this partial-vector stuff for the leftover elements of an odd-sized operation then great, one partial vector is much better than 3 scalar iterations. But most people asking about using only 3 of 4 vector elements are asking because they're using SIMD wrong, e.g. adding geometry-vector as a SIMD vector is still cheap, but dot-product needs shuffles. See https://deplinenoise.wordpress.com/2015/03/06/slides-simd-at-insomniac-games-gdc-2015/ for some nice stuff about how to use SIMD the right way (SoA vs. AoS and so on). If you already know about that and are just using 3-element vectors for the odd corner case, not for most of the work, then that's fine.
Padding to a multiple of the vector width is generally great for odd sizes, but another option for some algos is a final unaligned vector that ends at the end of your data. A partially-overlapping store is fine, unless it's an in-place algorithm and you have to worry about not doing an element twice. (Or about store-forwarding stalls even for idempotent operations like AND-masking or clamping).
If you had just 2 float
elements left over, a movsd
load will load + zero-extend into an XMM register. You might as well get the compiler to do that instead of a movaps
.
Otherwise, if shuffling together 3 scalars, insertps
can zero elements. Or you might have known zero high parts of xmm regs from movss
loads from memory. So using a 0.0
as part of a vector-from-scalar initializer (like C++ _mm_set_ps()
) can be free for the compiler.
With AVX, you can consider using a masked load if you're worried about padding causing a subnormal. https://www.felixcloutier.com/x86/vmaskmov. But that's somewhat slower than vmovaps
. And masked stores are much more expensive on AMD, even Ryzen.
In Rust like in C, sizeof
is always a multiple of alignof
: a necessity since sizeof
is used as the stride in arrays and array elements need to be aligned properly.
As a result, even though you only use 12 bytes for your struct
, its sizeof
is 16 bytes anyway, with 4 bytes of "padding".
I would therefore propose a very pragmatic fix: own to the padding. Instead of making the internals of the struct
visible, give it a constructor and an accessor... and pad it to 16 bytes with the 1.0
value.
#[repr(C, align(16))]
pub struct Vector([f32; 4]);
impl Vector {
pub fn new(v: [f32; 3]) -> Vector {
Vector([v[0], v[1], v[2], 1.0])
}
pub fn pad(&mut self, pad: f32) { self.0[3] = pad; }
pub fn as_ptr(&self) -> *const f32 { self.0.as_ptr() }
}
Then you can perform operations with confidence that no garbage byte is used.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With