Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient way of rotating a byte inside an AVX register

Tags:

c

avx

simd

sse

avx2

Summary/tl;dr: Is there any way to rotate a byte in an YMM register bitwise (using AVX), other than doing 2x shifts and blending the results together?

For each 8 bytes in an YMM register, I need to left-rotate 7 bytes in it. Each byte needs to be rotated one bit more to the left than the former. Thus the 1 byte should be rotated 0 bits and the seventh should be rotated 6 bits.

Currently, I have made an implementation that does this by [I use the 1-bit rotate as an example here] shifting the register 1 bit to the left, and 7 to the right individually. I then use the blend operation (intrinsic operation _mm256_blend_epi16) to chose the correct bits from the first and second temporary result to get my final rotated byte.
This costs a total of 2 shift operations and 1 blend operation per byte, and 6 bytes needs to be rotated, thus 18 operations per byte (shift and blend has just about the same performance).

There must be a faster way to do this than by using 18 operations to rotate a single byte!

Furthermore, I need to assemble all the bytes afterwards in the new register. I do this by loading 7 masks with the "set" instruction into registers, so I can extract the correct byte from each register. I AND these mask with the registers to extract the correct byte from them. Afterwards I XOR the single byte registers together to get the new register with all the bytes. This takes a total of 7+7+6 operations, so another 20 operations (per register).

I could use the extract intrinsic (_mm256_extract_epi8) to get the single bytes, and then use _mm256_set_epi8 to assemble the new registers, but I don't know yet whether that would be faster. (There is no listed performance for these functions in the Intel intrinsics guide, so maybe I am misunderstanding something here.)

This gives a total of 38 operations per register, which seems less than optimal for rotating 6 bytes differently inside a register.

I hope someone more proficient in AVX/SIMD can guide me here—whether I am going about this the wrong way—as I feel I might be doing just that right now.

like image 940
oPolo Avatar asked May 02 '16 13:05

oPolo


2 Answers

The XOP instruction set does provide _mm_rot_epi8() (which is NOT Microsoft-specific; it is also available in GCC since 4.4 or earlier, and should be available in recent clang, too). It can be used to perform the desired task in 128-bit units. Unfortunately, I don't have a CPU with XOP support, so I cannot test that.

On AVX2, splitting the 256-bit register into two halves, one containing even bytes, and the other odd bytes shifted right 8 bits, allows a 16-bit vector multiply to do the trick. Given constants (using GCC 64-bit component array format)

static const __m256i epi16_highbyte = { 0xFF00FF00FF00FF00ULL,
                                        0xFF00FF00FF00FF00ULL,
                                        0xFF00FF00FF00FF00ULL,
                                        0xFF00FF00FF00FF00ULL };
static const __m256i epi16_lowbyte  = { 0x00FF00FF00FF00FFULL,
                                        0x00FF00FF00FF00FFULL,
                                        0x00FF00FF00FF00FFULL,
                                        0x00FF00FF00FF00FFULL };
static const __m256i epi16_oddmuls  = { 0x4040101004040101ULL,
                                        0x4040101004040101ULL,
                                        0x4040101004040101ULL,
                                        0x4040101004040101ULL };
static const __m256i epi16_evenmuls = { 0x8080202008080202ULL,
                                        0x8080202008080202ULL,
                                        0x8080202008080202ULL,
                                        0x8080202008080202ULL };

the rotation operation can be written as

__m256i byteshift(__m256i value)
{
    return _mm256_or_si256(_mm256_srli_epi16(_mm256_mullo_epi16(_mm256_and_si256(value, epi16_lowbyte), epi16_oddmuls), 8),
                           _mm256_and_si256(_mm256_mullo_epi16(_mm256_and_si256(_mm256_srai_epi16(value, 8), epi16_lowbyte), epi16_evenmuls), epi16_highbyte));
}

This has been verified to yield correct results on Intel Core i5-4200U using GCC-4.8.4. As an example, the input vector (as a single 256-bit hexadecimal number)

88 87 86 85 84 83 82 81 38 37 36 35 34 33 32 31 28 27 26 25 24 23 22 21 FF FE FD FC FB FA F9 F8

gets rotated into

44 E1 D0 58 24 0E 05 81 1C CD C6 53 A1 CC 64 31 14 C9 C4 52 21 8C 44 21 FF BF BF CF DF EB F3 F8

where the leftmost octet is rotated left by 7 bits, next 6 bits, and so on; seventh octet is unchanged, eighth octet is rotated by 7 bits, and so on, for all 32 octets.

I am not sure if the above function definition compiles to optimal machine code -- that depends on the compiler --, but I'm certainly happy with its performance.

Since you probably dislike the above concise format for the function, here it is in procedural, expanded form:

static __m256i byteshift(__m256i value)
{
    __m256i low, high;
    high = _mm256_srai_epi16(value, 8);
    low = _mm256_and_si256(value, epi16_lowbyte);
    high = _mm256_and_si256(high, epi16_lowbyte);
    low = _mm256_mullo_epi16(low, epi16_lowmuls);
    high = _mm256_mullo_epi16(high, epi16_highmuls);
    low = _mm256_srli_epi16(low, 8);
    high = _mm256_and_si256(high, epi16_highbyte);
    return _mm256_or_si256(low, high);
}

In a comment, Peter Cordes suggested replacing the srai+and with an srli, and possibly the final and+or with a blendv. The former makes a lot of sense, as it is purely an optimization, but the latter may not (yet, on current Intel CPUs!) actually be faster.

I tried some microbenchmarking, but was unable to get reliable results. I typically use the TSC on x86-64, and take the median of a few hundred thousand tests using inputs and outputs stored to an array.

I think it is most useful if I will just list the variants here, so any user requiring such a function can make some benchmarks on their real-world workloads, and test to see if there is any measurable difference.

I also agree with his suggestion to use odd and even instead of high and low, but note that since the first element in a vector is numbered element 0, the first element is even, the second odd, and so on.

#include <immintrin.h>

static const __m256i epi16_oddmask  = { 0xFF00FF00FF00FF00ULL,
                                        0xFF00FF00FF00FF00ULL,
                                        0xFF00FF00FF00FF00ULL,
                                        0xFF00FF00FF00FF00ULL };
static const __m256i epi16_evenmask = { 0x00FF00FF00FF00FFULL,
                                        0x00FF00FF00FF00FFULL,
                                        0x00FF00FF00FF00FFULL,
                                        0x00FF00FF00FF00FFULL };
static const __m256i epi16_evenmuls = { 0x4040101004040101ULL,
                                        0x4040101004040101ULL,
                                        0x4040101004040101ULL,
                                        0x4040101004040101ULL };
static const __m256i epi16_oddmuls  = { 0x8080202008080202ULL,
                                        0x8080202008080202ULL,
                                        0x8080202008080202ULL,
                                        0x8080202008080202ULL };

/* Original version suggested by Nominal Animal. */
__m256i original(__m256i value)
{
    return _mm256_or_si256(_mm256_srli_epi16(_mm256_mullo_epi16(_mm256_and_si256(value, epi16_evenmask), epi16_evenmuls), 8),
                           _mm256_and_si256(_mm256_mullo_epi16(_mm256_and_si256(_mm256_srai_epi16(value, 8), epi16_evenmask), epi16_oddmuls), epi16_oddmask));
}

/* Optimized as suggested by Peter Cordes, without blendv */
__m256i no_blendv(__m256i value)
{
    return _mm256_or_si256(_mm256_srli_epi16(_mm256_mullo_epi16(_mm256_and_si256(value, epi16_evenmask), epi16_evenmuls), 8),
                           _mm256_and_si256(_mm256_mullo_epi16(_mm256_srli_epi16(value, 8), epi16_oddmuls), epi16_oddmask));
}

/* Optimized as suggested by Peter Cordes, with blendv.
 * This is the recommended version. */
__m256i optimized(__m256i value)
{
    return _mm256_blendv_epi8(_mm256_srli_epi16(_mm256_mullo_epi16(_mm256_and_si256(value, epi16_evenmask), epi16_evenmuls), 8),
                              _mm256_mullo_epi16(_mm256_srli_epi16(value, 8), epi16_oddmuls), epi16_oddmask);
}

Here are the same functions written in a way that shows the individual operations. Although it does not affect sane compilers at all, I've marked the function parameter and each temporary value const, so that it is obvious how you can insert each into a subsequent expression, to simplify the functions to their above concise forms.

__m256i original_verbose(const __m256i value)
{
    const __m256i odd1  = _mm256_srai_epi16(value, 8);
    const __m256i even1 = _mm256_and_si256(value, epi16_evenmask);
    const __m256i odd2  = _mm256_and_si256(odd1, epi16_evenmask);
    const __m256i even2 = _mm256_mullo_epi16(even1, epi16_evenmuls);
    const __m256i odd3  = _mm256_mullo_epi16(odd3, epi16_oddmuls);
    const __m256i even3 = _mm256_srli_epi16(even3, 8);
    const __m256i odd4  = _mm256_and_si256(odd3, epi16_oddmask);
    return _mm256_or_si256(even3, odd4);
}

__m256i no_blendv_verbose(const __m256i value)
{
    const __m256i even1 = _mm256_and_si256(value, epi16_evenmask);
    const __m256i odd1  = _mm256_srli_epi16(value, 8);
    const __m256i even2 = _mm256_mullo_epi16(even1, epi16_evenmuls);
    const __m256i odd2  = _mm256_mullo_epi16(odd1, epi16_oddmuls);
    const __m256i even3 = _mm256_srli_epi16(even2, 8);
    const __m256i odd3  = _mm256_and_si256(odd2, epi16_oddmask);
    return _mm256_or_si256(even3, odd3);
}

__m256i optimized_verbose(const __m256i value)
{
    const __m256i even1 = _mm256_and_si256(value, epi16_evenmask);
    const __m256i odd1  = _mm256_srli_epi16(value, 8);
    const __m256i even2 = _mm256_mullo_epi16(even1, epi16_evenmuls);
    const __m256i odd2  = _mm256_mullo_epi16(odd1, epi16_oddmuls);
    const __m256i even3 = _mm256_srli_epi16(even2, 8);
    return _mm256_blendv_epi8(even3, odd2, epi16_oddmask);
}

I personally do write my test functions initially in their above verbose forms, as forming the concise version is a trivial set of copy-pasting. I do, however, testing both versions to verify against introducing any errors, and keeping the verbose version accessible (as a comment or so), because the concise versions are basically write-only. It is much easier to edit the verbose version, then simplify it to the concise form, than trying to edit the concise version.

like image 60
Nominal Animal Avatar answered Sep 25 '22 06:09

Nominal Animal


[Based on the first comment and some edits, the resulting solution is a little different. I will present that first, then leave the original thought below]

The main idea here is using multiplication by powers of 2 to accomplish the shifting, since these constants can vary across the vector. @harold pointed out the next idea, which is that multiplication of two duplicated bytes will automatically do the "rotation" of the shifted-out bits back into the lower bits.

  1. Unpack and duplicate bytes into 16-bit values [... d c b a] -> [... dd cc bb aa]
  2. Generate a 16-bit constant [128 64 32 16 8 4 2 1]
  3. Multiply
  4. The byte you want is the top eight bits of each 16-bit value, so right-shift and repack

Assuming __m128i source (you only have 8 bytes, right?):

__m128i duped = _mm_unpacklo_epi8(src, src);
__m128i res = _mm_mullo_epi16(duped, power_of_two_vector);
__m128i repacked = _mm_packus_epi16(_mm_srli_epi16(res, 8), __mm_setzero_si128());

[saving this original idea for comparison]

What about this: Use multiplication by powers of 2 to accomplish the shifts, using 16-bit products. Then OR the upper and lower halves of the product to accomplish the rotation.

  1. Unpack the bytes into 16-bit words.
  2. Generate a 16-bit [ 128 64 32 16 8 4 2 1 ]
  3. Multiply the 16-bit words
  4. Re-pack the 16-bit into two eight-bit vectors, a high byte vector and a low-byte vector
  5. OR those two vectors to accomplish the rotate.

I'm a little fuzzy on the available multiply options and your instruction set limitation, but ideal would be an 8-bit by 8-bit multiply that produces 16-bit products. As far as I know it doesn't exist, which is why I suggest unpacking first, but I've seen other neat algorithms for doing this.

like image 43
Peter Avatar answered Sep 24 '22 06:09

Peter