Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Shuffling by mask with Intel AVX

I'm new to AVX programming. I have a register which needs to be shuffled. I want to shuffle several bytes from a 256-bit register, R1, to an empty register R2. I want to define a mask which tells the shuffle operation which byte from the old register(R1) should be copied at which place in the new register.

The mask should look like this(Src:Byte Pos in R1, Target:Byte Pos in R2):

{(0,0),(1,1),(1,4),(2,5),...}

This means several bytes are copied twice.

I'm not 100% sure which function I should use for this. I tried a bit with these two AVX functions, the second just uses 2 lanes.

__m256 _mm256_permute_ps (__m256 a, int imm8)
__m256 _mm256_shuffle_ps (__m256 a, __m256 b, const int imm8)

I'm totally confused about the Shuffle Mask in imm8 and how to design it so that it would work as described above.

I had a look in this slides(page 26) were _MM_SHUFFLE is described but I can't find a solution to my problem.

Are there any tutorials on how to design such a mask? Or example functions for the two methods to understand them in depth?

Thanks in advance for hints

like image 796
Thorgas Avatar asked Apr 30 '18 10:04

Thorgas


1 Answers

TL:DR: you probably either need multiple shuffles to handle lane-crossing, or if your pattern continues exactly like that you can use _mm256_cvtepu16_epi32 (vpmovzxwd) and then _mm256_blend_epi16.


For x86 shuffles (like most SIMD instruction-sets, I think), the destination position is implicit. A shuffle-control constant just has source indices in destination order, whether it's an imm8 that gets compiled+assembled right into an asm instruction or whether it's a vector with an index in each element.

Each destination position reads exactly one source position, but the same source position can be read more than once. Each destination element gets a value from the shuffle source.

See Convert _mm_shuffle_epi32 to C expression for the permutation? for a plain-C version of dst = _mm_shuffle_epi32(src, _MM_SHUFFLE(d,c,b,a)), showing how the control byte is used.

(For pshufb / _mm_shuffle_epi8, an element with the high bit set zeros that destination position instead of reading any source element, but other x86 shuffles ignore all the high bits in shuffle-control vectors.)

Without AVX512 merge-masking, there are no shuffles that also blend into a destination. There are some two-source shuffles like _mm256_shuffle_ps (vshufps) which can shuffle together elements from two sources to produce a single result vector. If you wanted to leave some destination elements unwritten, you'll probably have to shuffle and then blend, e.g. with _mm256_blendv_epi8, or if you can use blend with 16-bit granularity you can use a more efficient immediate blend _mm256_blend_epi16, or even better _mm256_blend_epi32 (AVX2 vpblendd is as cheap as _mm256_and_si256 on Intel CPUs, and is the best choice if you do need to blend at all, if it can get the job done; see http://agner.org/optimize/)


For your problem (without AVX512VBMI vpermb in Cannonlake), you can't shuffle single bytes from the low 16 "lane" into the high 16 "lane" of a __m256i vector with a single operation.

AVX shuffles are not like a full 256-bit SIMD, they're more like two 128-bit operations in parallel. The only exceptions are some AVX2 lane-crossing shuffles with 32-bit granularity or larger, like vpermd (_mm256_permutevar8x32_epi32). And also the AVX2 versions of pmovzx / pmovsx, e.g. pmovzxbq does zero-extend the low 4 bytes of an XMM register into the 4 qwords of a YMM register, rather than the low 2 bytes of each half of a YMM register. This makes it much more useful with a memory source operand.

But anyway, the AVX2 version of pshufb (_mm256_shuffle_epi8) does two separate 16x16 byte shuffles in the two lanes of a 256-bit vector.


You're probably going to want something like this:

// Intrinsics have different types for integer, float and double vectors
// the asm uses the same registers either way
__m256i  shuffle_and_blend(__m256i dst, __m256i src)
{
    // setr takes element in low to high order, like a C array init
    // unlike the standard Intel notation where high element is first
    const __m256i  shuffle_control = _mm256_setr_epi8(
          0,      1,  -1, -1,   1,      2, ...);
    // {(0,0),  (1,1), (zero)  (1,4), (2,5),...}  in your src,dst notation
    // Use -1 or 0x80 or anything with the high bit set
    //  for positions you want to leave unmodified in dst
   // blendv uses the high bit as a blend control, so the same vector can do double duty

    // maybe need some lane-crossing stuff depending on the pattern of your shuffle.
    __m256i  shuffled = _mm256_shuffle_epi8(src, shuffle_control);

    // or if the pattern continues, and you're just leaving 2 bytes between every 2-byte group:
    shuffled = _mm256_cvtepu16_epi32(src);  // if src is a __m128i

    __m256i  blended = _mm256_blendv_epi8(shuffled, dst, shuffle_control);
    // blend dst elements we want to keep into the shuffled src result.
    return blended;
}    

Note that the pshufb numbering restarts from 0 for the 2nd 16 bytes. The two halves of the __m256i can be different, but they can't read elements from the other half. If you need positions in the high lane to get bytes from the low lane, you'll need more shuffling + blending (e.g. including vinserti128 or vperm2i128, or maybe a vpermd lane-crossing dword shuffle) to get all the bytes you need into one 16-byte group in some order.

(Actually _mm256_shuffle_epi8 (PSHUFB) ignores bits 4..6 in a shuffle index, so writing 17 is the same as 1, but very misleading. It's effectively doing a %16, as long as the high bit isn't set. If the high bit is set in the shuffle-control vector, it zeros that element. We don't need that functionality here; _mm256_blendv_epi8 doesn't care about the old value of the element it's replacing)

Anyway, this simple 2-instruction example only works if the pattern doesn't continue. If you want help designing your real shuffles, you'll have to ask a more specific question.


And BTW, I notice that your blend pattern used 2 new bytes then 2 skipped 2. If that continues, you could use vpblendw _mm256_blend_epi16 instead of blendv, because that instruction runs in only 1 uop instead of 2 on Intel CPUs. It would also allow you to use AVX512BW vpermw, a 16-bit shuffle available in current Skylake-AVX512 CPUs, instead of the probably-even-slower AVX512VBMI vpermb.

Or actually, it would maybe let you use vpmovzxwd (_mm256_cvtepu16_epi32) to zero-extend 16-bit elements to 32-bit, as a lane-crossing shuffle. Then blend with dst.

like image 109
Peter Cordes Avatar answered Oct 04 '22 09:10

Peter Cordes