Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Writing a portable SSE/AVX version of std::copysign

I am currently writing a vectorized version of the QR decomposition (linear system solver) using SSE and AVX intrinsics. One of the substeps requires to select the sign of a value opposite/equal to another value. In the serial version, I used std::copysign for this. Now I want to create a similar function for SSE/AVX registers. Unfortunately, the STL uses a built-in function for that, so I can't just copy the code and turn it into SSE/AVX instructions.

I have not tried it yet (so I have no code to show for now), but my simple approach would be to create a register with all values set to -0.0 so that only the signed bit is set. Then I would use an AND operation on the source to find out if its sign is set or not. The result of this operation would either be 0.0 or -0.0, depending on the sign of the source. With the result, I would create a bitmask (using logic operations) which I can combine with the target register (using another logic operation) to set the sign accordingly.

However, I am not sure if there isn't a smarter way to solve this. If there is a built-in function for fundamental data types like floats and doubles, maybe there is also an intrinsic that I missed. Any suggestions?

Thanks in advance

EDIT:

Thanks to "chtz" for this useful link:

https://godbolt.org/z/oY0f7c

So basically std::copysign compiles to a sequence of 2 AND operations and a subsequent OR. I will reproduce this for SSE/AVX and post the result here in case somebody else needs it some day :)

EDIT 2:

Here is my working version:

__m128 CopySign(__m128 srcSign, __m128 srcValue)
{
    // Extract the signed bit from srcSign
    const __m128 mask0 = _mm_set1_ps(-0.);
    __m128 tmp0 = _mm_and_ps(srcSign, mask0);

    // Extract the number without sign of srcValue (abs(srcValue))
    __m128 tmp1 = _mm_andnot_ps(mask0, srcValue);

    // Merge signed bit with number and return
    return _mm_or_ps(tmp0, tmp1);
}

Tested it with:

__m128 a = _mm_setr_ps(1, -1, -1, 1);
__m128 b = _mm_setr_ps(-5, -11, 3, 4);

__m128 c = CopySign(a, b);

for (U32 i = 0; i < 4; ++i)
    std::cout << simd::GetValue(c, i) << std::endl;

The output is as expected:

5
-11
-3
4

However, I also tried the version from the disassembly where

__m128 tmp1 = _mm_andnot_ps(mask0, srcValue);

is replaced with:

const __m128 mask1 = _mm_set1_ps(NAN);
__m128 tmp1 = _mm_and_ps(srcValue, mask1);

The results are quite strange:

4
-8
-3
4

Depending on the chosen numbers, the number is sometimes okay and sometimes not. The sign is always correct. It seems like NaN is not !(-0.0) for some reason. I remember that I had some issues before when I tried to set register values to NaN or specific bit patterns. Maybe somebody has an idea about the origin of the problem?

EDIT 3:

As 'Maxim Egorushkin' clarified in the comments of his answer, my expectation about NaN being !(-0.0) is wrong. NaN seems not to be a unique bit pattern (see https://steve.hollasch.net/cgindex/coding/ieeefloat.html).

Thank you very much to all of you!

like image 528
wychmaster Avatar asked Sep 10 '19 12:09

wychmaster


2 Answers

AVX versions for float and double:

#include <immintrin.h>

__m256 copysign_ps(__m256 from, __m256 to) {
    constexpr float signbit = -0.f;
    auto const avx_signbit = _mm256_broadcast_ss(&signbit);
    return _mm256_or_ps(_mm256_and_ps(avx_signbit, from), _mm256_andnot_ps(avx_signbit, to)); // (avx_signbit & from) | (~avx_signbit & to)
}

__m256d copysign_pd(__m256d from, __m256d to) {
    constexpr double signbit = -0.;
    auto const avx_signbit = _mm256_broadcast_sd(&signbit);
    return _mm256_or_pd(_mm256_and_pd(avx_signbit, from), _mm256_andnot_pd(avx_signbit, to)); // (avx_signbit & from) | (~avx_signbit & to)
}

assembly

The Intel Intrinsics Guide


With AVX2 avx_signbit can be generated with no constants:

__m256 copysign2_ps(__m256 from, __m256 to) {
    auto a = _mm256_castps_si256(from);
    auto avx_signbit = _mm256_castsi256_ps(_mm256_slli_epi32(_mm256_cmpeq_epi32(a, a), 31));
    return _mm256_or_ps(_mm256_and_ps(avx_signbit, from), _mm256_andnot_ps(avx_signbit, to)); // (avx_signbit & from) | (~avx_signbit & to)
}

__m256d copysign2_pd(__m256d from, __m256d to) {
    auto a = _mm256_castpd_si256(from);
    auto avx_signbit = _mm256_castsi256_pd(_mm256_slli_epi64(_mm256_cmpeq_epi64(a, a), 63));
    return _mm256_or_pd(_mm256_and_pd(avx_signbit, from), _mm256_andnot_pd(avx_signbit, to)); // (avx_signbit & from) | (~avx_signbit & to)
}

Still though, both clang and gcc calculate avx_signbit at compile time and replace it with constants loaded from .rodata section, which is, IMO, sub-optimal.

like image 152
Maxim Egorushkin Avatar answered Nov 15 '22 22:11

Maxim Egorushkin


Here's a version that I think is slightly better than the accepted answer if you target icc:

__m256d copysign_pd(__m256d from, __m256d to) {
    __m256d const avx_sigbit = _mm256_set1_pd(-0.);
    return _mm256_or_pd(_mm256_and_pd(avx_sigbit, from), _mm256_andnot_pd(avx_sigbit, to));
}

It uses _mm256_set1_pd rather than an broadcast intrinsic. On clang and gcc this is mostly a wash, but on icc the broadcast version actually writes a constant to the stack and then broadcasts from it, which is ... terrible.

Godbolt showing AVX-512 code, adjust the -march= to -march=skylake to see AVX2 code.

Here's an untested AVX-512 version which uses vpterlogdq directly, which compiles down to a single vpterlogd instruction on icc and clang (gcc includes a separate broadcast):

__m512d copysign_pd_alt(__m512d from, __m512d to) {
    const __m512i sigbit = _mm512_castpd_si512(_mm512_set1_pd(-0.));
    return _mm512_castsi512_pd(_mm512_ternarylogic_epi64(_mm512_castpd_si512(from), _mm512_castpd_si512(to), sigbit, 0xE4));
}

You could make a 256-bit version of this for when AVX-512 is enabled but you're dealing with __m256* vectors.

like image 32
BeeOnRope Avatar answered Nov 15 '22 22:11

BeeOnRope