Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is permute needed in parallel SIMD/SSE/AVX ?

From my other question about "Using SIMD AVX SSE for tree traversal" ive got this code that im trying to benchmark. I havent done anything with SIMD before so I'm kinda new to this permutation stuff. First, lets see this code:

__m256i const perm_mask = _mm256_set_epi32(7, 6, 3, 2, 5, 4, 1, 0);

// compare the two halves of the cache line.
__m256i cmp1 = _mm256_load_si256(&node->m256[0]);
__m256i cmp2 = _mm256_load_si256(&node->m256[1]);

cmp1 = _mm256_cmpgt_epi32(cmp1, value); // PCMPGTD
cmp2 = _mm256_cmpgt_epi32(cmp2, value); // PCMPGTD

// merge the comparisons back together.
//
// a permute is required to get the pack results back into order
// because AVX-256 introduced that unfortunate two-lane interleave.
//
// alternately, you could pre-process your data to remove the need
// for the permute.
__m256i cmp = _mm256_packs_epi32(cmp1, cmp2); // PACKSSDW
cmp = _mm256_permutevar8x32_epi32(cmp, perm_mask); // PERMD

// finally create a move mask and count trailing
// zeroes to get an index to the next node.

unsigned mask = _mm256_movemask_epi8(cmp); // PMOVMSKB
return _tzcnt_u32(mask) / 2; // TZCNT

The author, Cory Nelson tried to explain it with the comments. However, I'm not really getting how this permutations work and why it does end up to "extract" the wanted information from the result vector.

Could anybody help me out to understand how the permutation, movemask an TZCNT is used in this code and what "packing/unpacking" means in this context ? I'd be thankfull for any resources you might have about it - google aint that helpfull with this very special topic.

like image 300
user1610743 Avatar asked Jan 04 '14 09:01

user1610743


1 Answers

Intel's instruction set manuals will be invaluable to your learning of SIMD. It explains in great detail what each of those instructions is doing.

"Packing" in SSE/AVX is basically a downcast and merge of two registers. PACKSSDW packs 32-bit signed ints from two registers into 16-bit signed ints in one register, and saturates the values (so values < -32768 will be set to -32768, and >32767 will be set to 32767)

A permute is a way of reordering the values in a register. Each value in the mask register specifies an index into the source. This is required because AVX256 "cheated" a little and processes most of its mixing instructions as two 128-bit "lanes".

The 128-bit version of PACKSSDW performs this:

r0 := SignedSaturate(a0)
r1 := SignedSaturate(a1)
r2 := SignedSaturate(a2)
r3 := SignedSaturate(a3)
r4 := SignedSaturate(b0)
r5 := SignedSaturate(b1)
r6 := SignedSaturate(b2)
r7 := SignedSaturate(b3)

You'd expect the 256-bit version to maintain the same natural ordering with all the "A"s first and the "B"s second, like this:

r0 := SignedSaturate(a0)
r1 := SignedSaturate(a1)
r2 := SignedSaturate(a2)
r3 := SignedSaturate(a3)
r4 := SignedSaturate(a4)
r5 := SignedSaturate(a5)
r6 := SignedSaturate(a6)
r7 := SignedSaturate(a7)
r8 := SignedSaturate(b0)
r9 := SignedSaturate(b1)
r10 := SignedSaturate(b2)
r11 := SignedSaturate(b3)
r12 := SignedSaturate(b4)
r13 := SignedSaturate(b5)
r14 := SignedSaturate(b6)
r15 := SignedSaturate(b7)

But instead, what it actually does this:

r0 := SignedSaturate(a0) // lane one, the low 128 bits.
r1 := SignedSaturate(a1)
r2 := SignedSaturate(a2)
r3 := SignedSaturate(a3)
r4 := SignedSaturate(b0)
r5 := SignedSaturate(b1)
r6 := SignedSaturate(b2)
r7 := SignedSaturate(b3)
r8 := SignedSaturate(a4) // lane two, the high 128 bits.
r9 := SignedSaturate(a5)
r10 := SignedSaturate(a6)
r11 := SignedSaturate(a7)
r12 := SignedSaturate(b4)
r13 := SignedSaturate(b5)
r14 := SignedSaturate(b6)
r15 := SignedSaturate(b7)

The result is that when comparing an array of neatly ordered values, the 128-bit version keeps them ordered while the 256-bit version will mix them. The permute puts them back into order.

As I alluded to in my post, you can get rid of the permute in this code by preprocessing your node's array to have the inverse, so that the "mixed" results of the 256-bit op puts it back in order:

void preprocess_avx2(bnode* const node)
{
    __m256i const perm_mask = _mm256_set_epi32(3, 2, 1, 0, 7, 6, 5, 4);
    __m256i *const middle = (__m256i*)&node->i32[4];

    __m256i x = _mm256_loadu_si256(middle);
    x = _mm256_permutevar8x32_epi32(x, perm_mask);
    _mm256_storeu_si256(middle, x);
}

The ordering is important because of what it does next.

The compare works on 16 32-bit values, but it results in either 0x0000 or 0xFFFF for all of them. You essentially only have 16 bits of information -- off or on for each value. PMOVMSKB treats the input as 32 8-byte values and packs the high bits of each (which is all we need, since all the bits are the same) into a 32-bit int.

TZCNT counts the trailing zero bits in that int, which gives an index to the first position that has a set bit: the index of the first byte in that SIMD register that compared as greater-than.

(Fun fact: TZCNT is a Haswell improvement over the existing BSF instruction, and in fact shares an encoding with it. The only difference is that TZCNT has a defined register output when its input is 0 -- with BSF you'd need to branch.)

like image 76
Cory Nelson Avatar answered Oct 03 '22 08:10

Cory Nelson