Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the instruction that gives branchless FP min and max on x86?

To quote (thanks to the author for developing and sharing the algorithm!):

https://tavianator.com/fast-branchless-raybounding-box-intersections/

Since modern floating-point instruction sets can compute min and max without branches

Corresponding code by the author is just

dmnsn_min(double a, double b)
{
  return a < b ? a : b;
}

I'm familiar with e.g. _mm_max_ps, but that's a vector instruction. The code above obviously is meant to be used in a scalar form.

Question:

  • What is the scalar branchless minmax instruction on x86? Is it a sequence of instructions?
  • Is it safe to assume it's going to be applied, or how do I call it?
  • Does it make sense to bother about branchless-ness of min/max? From what I understand, for a raytracer and / or other viz software, given a ray - box intersection routine, there is no reliable pattern for the branch predictor to pick up, hence it does make sense to eliminate the branch. Am I right about this?
  • Most importantly, the algorithm discussed is built around comparing against (+/-) INFINITY. Is this reliable w.r.t the (unknown) instruction we're discussing and the floating-point standard?

Just in case: I'm familiar with Use of min and max functions in C++, believe it's related but not quite my question.

like image 915
iksemyonov Avatar asked Oct 22 '16 20:10

iksemyonov


2 Answers

Warning: Beware of compilers treating _mm_min_ps / _mm_max_ps (and _pd) intrinsics as commutative even in strict FP (not fast-math) mode; even though the asm instruction isn't. GCC specifically seems to have this bug: PR72867 which was fixed in GCC7 but may be back or never fixed for _mm_min_ss etc. scalar intrinsics (_mm_max_ss has different behavior between clang and gcc, GCC bugzilla PR99497).

GCC knows how the asm instructions themselves work, and doesn't have this problem when using them to implement strict FP semantics in plain scalar code, only with the C/C++ intrinsics.

Unfortunately there isn't a single instruction that implements fmin(a,b) (with guaranteed NaN propagation), so you have to choose between easy detection of problems vs. higher performance.


Most vector FP instructions have scalar equivalents. MINSS / MAXSS / MINSD / MAXSD are what you want. They handle +/-Infinity the way you'd expect.

MINSS a,b exactly implements (a<b) ? a : b according to IEEE rules, with everything that implies about signed-zero, NaN, and Infinities. (i.e. it keeps the source operand, b, on unordered.) This means C++ compilers can use them for std::min(b,a) and std::max(b,a), because those functions are based on the same expression. Note the b,a operand order for the std:: functions, opposite Intel-syntax for x86 asm, but matching AT&T syntax.

MAXSS a,b exactly implements (b<a) ? a : b, again keeping the source operand (b) on unordered. Like std::max(b,a).

Looping over an array with x = std::min(arr[i], x); (i.e. minss or maxss xmm0, [rsi]) will take a NaN from memory if one is present, and then take whatever non-NaN element is next because that compare will be unordered. So you'll get the min or max of the elements following the last NaN. You normally don't want this, so it's only good for arrays that don't contain NaN. But it means you can start with float v = NAN; outside a loop, instead of the first element or FLT_MAX or +Infinity, and might simplify handling possibly-empty lists. It's also convenient in asm, allowing init with pcmpeqd xmm0,xmm0 to generate an all-ones bit-pattern (a negative QNAN), but unfortunately GCC's NAN uses a different bit-pattern.

Demo/proof on the Godbolt compiler explorer, including showing that v = std::min(v, arr[i]); (or max) ignores NaNs in the array, at the cost of having to load into a register and then minss into that register.

(Note that min of an array should use vectors, not scalar; preferably with multiple accumulators to hide FP latency. At the end, reduce to one vector then do horizontal min of it, just like summing an array or doing a dot product.)


Don't try to use _mm_min_ss on scalar floats; the intrinsic is only available with __m128 operands, and Intel's intrinsics don't provide any way to get a scalar float into the low element of a __m128 without zeroing the high elements or somehow doing extra work. Most compilers will actually emit the useless instructions to do that even if the final result doesn't depend on anything in the upper elements. (Clang can often avoid it, though, applying the as-if rule to the contents of dead vector elements.) There's nothing like __m256 _mm256_castps128_ps256 (__m128 a) to just cast a float to a __m128 with garbage in the upper elements. I consider this a design flaw. :/

But fortunately you don't need to do this manually, compilers know how to use SSE/SSE2 min/max for you. Just write your C such that they can. The function in your question is ideal: as shown below (Godbolt link):

// can and does inline to a single MINSD instruction, and can auto-vectorize easily
static inline double
dmnsn_min(double a, double b) {
  return a < b ? a : b;
}

Note their asymmetric behaviour with NaN: if the operands are unordered, dest=src (i.e. it takes the second operand if either operand is NaN). This can be useful for SIMD conditional updates, see below.

(a and b are unordered if either of them is NaN. That means a<b, a==b, and a>b are all false. See Bruce Dawson's series of articles on floating point for lots of FP gotchas.)

The corresponding _mm_min_ss / _mm_min_ps intrinsics may or may not have this behaviour, depending on the compiler.

I think the intrinsics are supposed to have the same operand-order semantics as the asm instructions, but gcc has treated the operands to _mm_min_ps as commutative even without -ffast-math for a long time, gcc4.4 or maybe earlier. GCC 7 finally changed it to match ICC and clang.

Intel's online intrinsics finder doesn't document that behaviour for the function, but it's maybe not supposed to be exhaustive. The asm insn ref manual doesn't say the intrinsic doesn't have that property; it just lists _mm_min_ss as the intrinsic for MINSS.

When I googled on "_mm_min_ps" NaN, I found this real code and some other discussion of using the intrinsic to handle NaNs, so clearly many people expect the intrinsic to behave like the asm instruction. (This came up for some code I was writing yesterday, and I was already thinking of writing this up as a self-answered Q&A.)

Given the existence of this longstanding gcc bug, portable code that wants to take advantage of MINPS's NaN handling needs to take precautions. The standard gcc version on many existing Linux distros will mis-compile your code if it depends on the order of operands to _mm_min_ps. So you probably need an #ifdef to detect actual gcc (not clang etc), and an alternative. Or just do it differently in the first place :/ Perhaps with a _mm_cmplt_ps and boolean AND/ANDNOT/OR.

Enabling -ffast-math also makes _mm_min_ps commutative on all compilers.


As usual, compilers know how to use the instruction set to implement C semantics correctly. MINSS and MAXSS are faster than anything you could do with a branch anyway, so just write code that can compile to one of those.

The commutative-_mm_min_ps issue applies to only the intrinsic: gcc knows exactly how MINSS/MINPS work, and uses them to correctly implement strict FP semantics (when you don't use -ffast-math).

You don't usually need to do anything special to get decent scalar code out of a compiler. But if you are going to spend time caring about what instructions the compiler uses, you should probably start by manually vectorizing your code if the compiler isn't doing that.

(There may be rare cases where a branch is best, if the condition almost always goes one way and latency is more important than throughput. MINPS latency is ~3 cycles, but a perfectly predicted branch adds 0 cycles to the dependency chain of the critical path.)


In C++, use std::min and std::max, which are defined in terms of > or <, and don't have the same requirements on NaN behaviour that fmin and fmax do. Avoid fmin and fmax for performance unless you need their NaN behaviour.

In C, I think just write your own min and max functions (or macros if you do it safely).


C & asm on the Godbolt compiler explorer

float minfloat(float a, float b) {
  return (a<b) ? a : b;
}
# any decent compiler (gcc, clang, icc), without any -ffast-math or anything:
    minss   xmm0, xmm1
    ret

// C++
float minfloat_std(float a, float b) { return std::min(a,b); }
  # This implementation of std::min uses (b<a) : b : a;
  # So it can produce the result only in the register that b was in
  # This isn't worse (when inlined), just opposite
    minss   xmm1, xmm0
    movaps  xmm0, xmm1
    ret


float minfloat_fmin(float a, float b) { return fminf(a, b); }

# clang inlines fmin; other compilers just tailcall it.
minfloat_fmin(float, float):
    movaps  xmm2, xmm0
    cmpunordss      xmm2, xmm2
    movaps  xmm3, xmm2
    andps   xmm3, xmm1
    minss   xmm1, xmm0
    andnps  xmm2, xmm1
    orps    xmm2, xmm3
    movaps  xmm0, xmm2
    ret
   # Obviously you don't want this if you don't need it.

If you want to use _mm_min_ss / _mm_min_ps yourself, write code that lets the compiler make good asm even without -ffast-math.

If you don't expect NaNs, or want to handle them specially, write stuff like

lowest = _mm_min_ps(lowest, some_loop_variable);

so the register holding lowest can be updated in-place (even without AVX).


Taking advantage of MINPS's NaN behaviour:

Say your scalar code is something like

if(some condition)
    lowest = min(lowest, x);

Assume the condition can be vectorized with CMPPS, so you have a vector of elements with the bits all set or all clear. (Or maybe you can get away with ANDPS/ORPS/XORPS on floats directly, if you just care about their sign and don't care about negative zero. This creates a truth value in the sign bit, with garbage elsewhere. BLENDVPS looks at only the sign bit, so this can be super useful. Or you can broadcast the sign bit with PSRAD xmm, 31.)

The straight-forward way to implement this would be to blend x with +Inf based on the condition mask. Or do newval = min(lowest, x); and blend newval into lowest. (either BLENDVPS or AND/ANDNOT/OR).

But the trick is that all-one-bits is a NaN, and a bitwise OR will propagate it. So:

__m128 inverse_condition = _mm_cmplt_ps(foo, bar);
__m128 x = whatever;


x = _mm_or_ps(x, condition);   // turn elements into NaN where the mask is all-ones
lowest = _mm_min_ps(x, lowest);  // NaN elements in x mean no change in lowest
//  REQUIRES NON-COMMUTATIVE _mm_min_ps: no -ffast-math
//  AND DOESN'T WORK AT ALL WITH MOST GCC VERSIONS.

So with only SSE2, and we've done a conditional MINPS in two extra instructions (ORPS and MOVAPS, unless loop unrolling allows the MOVAPS to disappear).

The alternative without SSE4.1 BLENDVPS is ANDPS/ANDNPS/ORPS to blend, plus an extra MOVAPS. ORPS is more efficient than BLENDVPS anyway (it's 2 uops on most CPUs).

like image 134
Peter Cordes Avatar answered Nov 02 '22 16:11

Peter Cordes


Peter Cordes's answer is great, I just figured I'd jump in with some shorter point-by-point answers:

  • What is the scalar branchless minmax instruction on x86? Is it a sequence of instructions?

I was referring to minss/minsd. And even other architectures without such instructions should be able to do this branchlessly with conditional moves.

  • Is it safe to assume it's going to be applied, or how do I call it?

gcc and clang will both optimize (a < b) ? a : b to minss/minsd, so I don't bother using intrinsics. Can't speak to other compilers though.

  • Does it make sense to bother about branchless-ness of min/max? From what I understand, for a raytracer and / or other viz software, given a ray - box intersection routine, there is no reliable pattern for the branch predictor to pick up, hence it does make sense to eliminate the branch. Am I right about this?

The individual a < b tests are pretty much completely unpredictable, so it is very important to avoid branching for those. Tests like if (ray.dir.x != 0.0) are very predictable, so avoiding those branches is less important, but it does shrink the code size and make it easier to vectorize. The most important part is probably removing the divisions though.

  • Most importantly, the algorithm discussed is built around comparing against (+/-) INFINITY. Is this reliable w.r.t the (unknown) instruction we're discussing and the floating-point standard?

Yes, minss/minsd behave exactly like (a < b) ? a : b, including their treatment of infinities and NaNs.

Also, I wrote a followup post to the one you referenced that talks about NaNs and min/max in more detail.

like image 24
Tavian Barnes Avatar answered Nov 02 '22 17:11

Tavian Barnes