SSE/AVX: Choose from two __m256 float vectors based on per-element min and max absolute value

Question

I am looking for efficient AVX (AVX512) implementation of

// Given
float u[8];
float v[8];

// Compute
float a[8];
float b[8];

//  Such that
for ( int i = 0; i < 8; ++i )
{
    a[i] = fabs(u[i]) >= fabs(v[i]) ? u[i] : v[i];
    b[i] = fabs(u[i]) <  fabs(v[i]) ? u[i] : v[i];
}

I.e., I need to select element-wise into a from u and v based on mask, and into b based on !mask, where mask = (fabs(u) >= fabs(v)) element-wise.

Jason R · Accepted Answer

I had this exact same problem just the other day. The solution I came up with (using AVX only) was:

// take the absolute value of u and v
__m256 sign_bit = _mm256_set1_ps(-0.0f);
__m256 u_abs = _mm256_andnot_ps(sign_bit, u);
__m256 v_abs = _mm256_andnot_ps(sign_bit, v);
// get a mask indicating the indices for which abs(u[i]) >= abs(v[i])
__m256 u_ge_v = _mm256_cmp_ps(u_abs, v_abs, _CMP_GE_OS);
// use the mask to select the appropriate elements into a and b, flipping the argument
// order for b to invert the sense of the mask
__m256 a = _mm256_blendv_ps(u, v, u_ge_v);
__m256 b = _mm256_blendv_ps(v, u, u_ge_v);

The AVX512 equivalent would be:

// take the absolute value of u and v
__m512 sign_bit = _mm512_set1_ps(-0.0f);
__m512 u_abs = _mm512_andnot_ps(sign_bit, u);
__m512 v_abs = _mm512_andnot_ps(sign_bit, v);
// get a mask indicating the indices for which abs(u[i]) >= abs(v[i])
__mmask16 u_ge_v = _mm512_cmp_ps_mask(u_abs, v_abs, _CMP_GE_OS);
// use the mask to select the appropriate elements into a and b, flipping the argument
// order for b to invert the sense of the mask
__m512 a = _mm512_mask_blend_ps(u_ge_v, u, v);
__m512 b = _mm512_mask_blend_ps(u_ge_v, v, u);

As Peter Cordes suggested in the comments above, there are other approaches as well like taking the absolute value followed by a min/max and then reinserting the sign bit, but I couldn't find anything that was shorter/lower latency than this sequence of instructions.

Actually, there is another approach using AVX512DQ's VRANGEPS via the _mm512_range_ps() intrinsic. Intel's intrinsic guide describes it as follows:

Calculate the max, min, absolute max, or absolute min (depending on control in imm8) for packed single-precision (32-bit) floating-point elements in a and b, and store the results in dst. imm8[1:0] specifies the operation control: 00 = min, 01 = max, 10 = absolute max, 11 = absolute min. imm8[3:2] specifies the sign control: 00 = sign from a, 01 = sign from compare result, 10 = clear sign bit, 11 = set sign bit.

Note that there appears to be a typo in the above; actually imm8[3:2] == 10 is "absolute min" and imm8[3:2] == 11 is "absolute max" if you look at the details of the per-element operation:

CASE opCtl[1:0] OF
0: tmp[31:0] := (src1[31:0] <= src2[31:0]) ? src1[31:0] : src2[31:0]
1: tmp[31:0] := (src1[31:0] <= src2[31:0]) ? src2[31:0] : src1[31:0]
2: tmp[31:0] := (ABS(src1[31:0]) <= ABS(src2[31:0])) ? src1[31:0] : src2[31:0]
3: tmp[31:0] := (ABS(src1[31:0]) <= ABS(src2[31:0])) ? src2[31:0] : src1[31:0]
ESAC

CASE signSelCtl[1:0] OF
0: dst[31:0] := (src1[31] << 31) OR (tmp[30:0])
1: dst[31:0] := tmp[63:0]
2: dst[31:0] := (0 << 31) OR (tmp[30:0])
3: dst[31:0] := (1 << 31) OR (tmp[30:0])
ESAC

RETURN dst

So you can get the same result with just two instructions:

auto a = _mm512_range_ps(v, u, 0x7);  // 0b0111 = sign from compare result, absolute max
auto b = _mm512_range_ps(v, u, 0x6);  // 0b0110 = sign from compare result, absolute min

The argument order (v, u) is a bit unintuitive, but it's needed in order to get the same behavior that you described in the OP in the event that the elements have equal absolute value (namely, that the value from u is passed through to a, and v goes to b).

On Skylake and Ice Lake Xeon platforms (probably any of the Xeons that have dual FMA units, probably?), VRANGEPS has throughput 2, so the two checks can issue and execute simultaneously, with latency of 4 cycles. This is only a modest latency improvement on the original approach, but the throughput is better and it requires fewer instructions/uops/instruction cache space.

Peter Cordes · Answer

clang does a pretty reasonable job of auto-vectorizing it with -ffast-math and the necessary __restrict qualifiers: https://godbolt.org/z/NMvN1u. and both inputs to ABS them, compare once, vblendvps twice on the original inputs with the same mask but the other sources in the opposite order to get min and max.

That's pretty much what I was thinking before checking what compilers did, and looking at their output to firm up the details I hadn't thought through yet. I don't see anything more clever than that. I don't think we can avoid abs()ing both a and b separately; there's no cmpps compare predicate that compares magnitudes and ignores the sign bit.

// untested: I *might* have reversed min/max, but I think this is right.
#include <immintrin.h>
// returns min_abs
__m256 minmax_abs(__m256 u, __m256 v,  __m256 *max_result) {
    const __m256 signbits = _mm256_set1_ps(-0.0f);
    __m256 abs_u = _mm256_andnot_ps(signbits, u);
    __m256 abs_v = _mm256_andnot_ps(signbits, v);  // strip the sign bit

    __m256 maxabs_is_v = _mm256_cmp_ps(abs_u, abs_v, _CMP_LT_OS);  // u < v

    *max_result = _mm256_blendv_ps(v, u, maxabs_is_v);
    return        _mm256_blendv_ps(u, v, maxabs_is_v);
}

You'd do the same thing with AVX512 except you compare into a mask instead of another vector.

// returns min_abs
__m512 minmax_abs512(__m512 u, __m512 v,  __m512 *max_result) {
    const __m512 absmask = _mm512_castsi512_ps(_mm512_set1_epi32(0x7fffffff));
    __m512 abs_u = _mm512_and_ps(absmask, u);
    __m512 abs_v = _mm512_and_ps(absmask, v);  // strip the sign bit

    __mmask16 maxabs_is_v = _mm512_cmp_ps_mask(abs_u, abs_v, _CMP_LT_OS);  // u < v

    *max_result = _mm512_mask_blend_ps(maxabs_is_v, v, u);
    return        _mm512_mask_blend_ps(maxabs_is_v, u, v);
}

Clang compiles the return statement in an interesting way (Godbolt):

.LCPI2_0:
    .long   2147483647              # 0x7fffffff
minmax_abs512(float __vector(16), float __vector(16), float __vector(16)*):           # @minmax_abs512(float __vector(16), float __vector(16), float __vector(16)*)
    vbroadcastss    zmm2, dword ptr [rip + .LCPI2_0]
    vandps  zmm3, zmm0, zmm2
    vandps  zmm2, zmm1, zmm2
    vcmpltps        k1, zmm3, zmm2
    vblendmps       zmm2 {k1}, zmm1, zmm0
    vmovaps zmmword ptr [rdi], zmm2   ## store the blend result
    vmovaps zmm0 {k1}, zmm1           ## interesting choice: blend merge-masking
    ret

Instead of using another vblendmps, clang notices that zmm0 already has one of the blend inputs, and uses merge-masking with a regular vector vmovaps. This has zero advantage of Skylake-AVX512 for 512-bit vblendmps (both single-uop instructions for port 0 or 5), but if Agner Fog's instruction tables are right, vblendmps x/y/zmm only ever runs on port 0 or 5, but a masked 256-bit or 128-bit vmovaps x/ymm{k}, x/ymm can run on any of p0/p1/p5.

Both are single-uop / single-cycle latency, unlike AVX2 vblendvps based on a mask vector which is 2 uops. (So AVX512 is an advantage even for 256-bit vectors). Unfortunately, none of gcc, clang, or ICC turn the _mm256_cmp_ps into _mm256_cmp_ps_mask and optimize the AVX2 intrinsics to AVX512 instructions when compiling with -march=skylake-avx512.)

s/512/256/ to make a version of minmax_abs512 that uses AVX512 for 256-bit vectors.

Gcc goes even further, and does the questionable "optimization" of

    vmovaps zmm2, zmm1        # tmp118, v
    vmovaps zmm2{k1}, zmm0    # tmp118, tmp114, tmp118, u

instead of using one blend instruction. (I keep thinking I'm seeing a store followed by a masked store, but no, neither compiler is blending that way).

SSE/AVX: Choose from two __m256 float vectors based on per-element min and max absolute value

Tags:

avx

sse

intrinsics

avx512

user2052436

2 Answers

Jason R

Peter Cordes

Recent Activity

Donate For Us

SSE/AVX: Choose from two __m256 float vectors based on per-element min and max absolute value

Tags:

avx

sse

intrinsics

avx512

user2052436

2 Answers

Jason R

Peter Cordes

Related questions

Recent Activity

Donate For Us