Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to efficiently perform int8/int64 conversion with SSE?

I'm implementing conversions between SSE types and I found that implementing int8->int64 widening conversion for pre-SSE4.1 targets is cumbersome.

The straightforward implementation would be:

inline __m128i convert_i8_i64(__m128i a)
{
#ifdef __SSE4_1__
    return _mm_cvtepi8_epi64(a);
#else
    a = _mm_unpacklo_epi8(a, a);
    a = _mm_unpacklo_epi16(a, a);
    a = _mm_unpacklo_epi32(a, a);
    return _mm_srai_epi64(a, 56); // missing instrinsic!
#endif
}

But since _mm_srai_epi64 doesn't exist until AVX-512, there are two options at this point:

  • implementing _mm_srai_epi64, or
  • implementing convert_i8_i64 in a different way.

I'm not sure which one would be the most efficient solution. Any idea?

like image 203
plasmacel Avatar asked Dec 26 '16 19:12

plasmacel


2 Answers

The unpacking intrinsics are used here in a funny way. They "duplicate" the data, instead of adding sign-extension, as one would expect. For example, before the first iteration you have in your register the following

x x x x x x x x x x x x x x a b

If you convert a and b to 16 bits, you should get this:

x x x x x x x x x x x x A a B b

Here A and B are sign-extensions of a and b, that is, both of them are either 0 or -1.

Instead of this, your code gives

x x x x x x x x x x x x a a b b

And then you convert it to the proper result by shifting right.

However, you are not obliged to use the same operand twice in the "unpack" intrinsics. You could get the desired result if you "unpacked" the following two registers:

x x x x x x x x x x x x x x a b
x x x x x x x x x x x x x x A B

That is:

a = _mm_unpacklo_epi8(a, _mm_srai_epi8(a, 8));

(if that _mm_srai_epi8 intrinsic actually existed)


You can apply the same idea to the last stage of your conversion. You want to "unpack" the following two registers:

x x x x x x x x A A A a B B B b
x x x x x x x x A A A A B B B B

To get them, right-shift the 32-bit data:

_mm_srai_epi32(a, 24)
_mm_srai_epi32(a, 32)

So the last "unpack" is

_mm_unpacklo_epi32(_mm_srai_epi32(a, 24), _mm_srai_epi32(a, 32));
like image 126
anatolyg Avatar answered Sep 21 '22 03:09

anatolyg


With SSSE3, you could use pshufb to avoid most of the unpacks. Using anatoly's a / A notation:

;; input in xmm0                ;; x x x x  x x x x | x x x x  x x a b
pshufb   xmm0, [low_to_upper]   ;; a 0 0 0  0 0 0 0 | b 0 0 0  0 0 0 0
psrad    xmm0, 24               ;; A A A a  0 0 0 0 | B B B b  0 0 0 0
pshufb   xmm0, [bcast_signextend]; A A A A  A A A a | B B B B  B B B b

Without SSSE3, I think you might be able to do something with PSHUFLW, PSHUFD, and maybe POR instead of some of the PUNPCK steps. But nothing I've thought of is actually better than the unpacks unless you're on a Core2 or other slow-shuffle CPU where pshuflw is faster than punpcklbw.

like image 38
Peter Cordes Avatar answered Sep 20 '22 03:09

Peter Cordes