Efficiently gather individual bytes, separated by a byte-stride of 4

Question

I'm trying to optimize an algorithm that will process massive datasets that could strongly benefit from AVX SIMD instructions. Unfortunately, the input memory layout is not optimal for the required computations. Information must be reordered, by assembling __m256i values from individual bytes that are exactly 4 bytes apart:

BEGIN EDIT

My target CPUS do not support AVX2 instructions, so like @Elalfer and @PeterCordes pointed out, I can't make use of __m256i values, code must be converted to use __m128i values instead)

END EDIT

DataSet layout in memory

Byte 0   | Byte 1   | Byte 2   | Byte 3
Byte 4   | Byte 5   | Byte 6   | Byte 7
...
Byte 120 | Byte 121 | Byte 122 | Byte 123
Byte 124 | Byte 125 | Byte 126 | Byte 127

Desired values in __m256i variable:

| Byte 0 | Byte 4 | Byte 8 |     ...     | Byte 120 | Byte 124 |

Is there a more efficient way to gather and rearrange the strided data other than this straightforward code?

union {  __m256i   reg;   uint8_t bytes[32]; } aux;
...
for( int i = 0; i < 32; i++ )
    aux.bytes[i] = data[i * 4];

Edit:

The step I'm trying to optimize is a bit column transposition; in other words, the bits of a certain column (32 possible bit columns in my data arrangement) should become a single uint32_t value, while the rest of the bits are ignored.

I perform the transposition by rearranging the data as shown, performing a left shift to bring the desired bit column as the most significant bits in each sub-byte, and finally extract and assemble the bits into a single uint32_t value via the _mm256_movemask_epi8() intrinsic.

Elalfer · Accepted Answer

One of the ways would be - pack the bytes with _mm256_shuffle_epi8, blend all _mm256_blend_epi32 resulting vectors (you'll need to do 4 such load+shuffle), and do a single 32bit permute _mm256_permutevar8x32_epi32.

Here is a pseudo code (I hope you can come up with the shuffle masks):

L1 = load32byte(buf)
L2 = load32byte(buf+32)
L3 = load32byte(buf+64)
L4 = load32byte(buf+96)

// Pack 4 bytes in the corresponding 32bit DWORD in each lane and zero-out other bytes
L1 = shuffle(L1, mask_for_L1)   
L2 = shuffle(L2, mask_for_L2)
L3 = shuffle(L3, mask_for_L3)
L4 = shuffle(L4, mask_for_L4)

// Vec = blend(blend(L1,L2),blend(L3,L4))
Vec = or(or(or(L1,L2),L3),L4)
Vec = permute(Vec)  // fix DWORD order in the vector

Update: Forgot the reason I said "zero-out other bytes" - this way you can replace blend with or

Update: Reduced one cycle latency by rearranging or operations per Peter's comment below.

PS. I'd also recommend you to take a look at the BMI Instruction Set as you do bit manipulations.

Peter Cordes · Answer

I only just noticed the edit, which has a special-case answer.

If you need to do many different bit positions on the same data, then your current plan is good.

If you only need one bit position (esp. the highest bit position) from 128B of memory, you could use _mm256_movemask_ps to get the high bit from each 32b element. Then combine four 8bit masks in GP registers.

A good compiler should optimize that to:

vmovdqu   ymm0, [buf + 0]
; to select a different bit:
; vpslld  ymm0, ymm0, count   ; count can be imm8 or the low byte of an xmm register
vmovmskps eax, ymm0

vmovdqu   ymm0, [buf + 32]
vmovmskps ebx, ymm0

...  ecx and edx

mov       ah, bl
mov       ch, dl
shl       ecx, 16
or        eax, ecx

This is nice only if you're testing the high bit (so you don't need to shift each vector before vmovmsk). Even so, this is probably more instructions (and code size) than the other solution.

Answer to the original question:

Similar to Elalfer's idea, but use the shuffle unit for pack instructions instead of pshufb. Also, all the ANDs are independent, so they can execute in parallel. Intel CPUs can do 3 ANDs at once, but only one shuffle. (Or two shuffles at once on pre-Haswell.)

// without AVX2: you won't really be able to
// do anything with a __m256i, only __m128i
// just convert everything to regular _mm_..., and leave out the final permute

mask = _mm256_set1_epi32(0x000000ff);

// same mask for all, and the load can fold into the AND
// You can write the load separately if you like, it'll still fold
L1 = and(mask, (buf))     // load and zero the bytes we don't want
L2 = and(mask, (buf+32))
L3 = and(mask, (buf+64))
L4 = and(mask, (buf+96))

// squish dwords from 2 concatenated regs down to words in 1 reg
pack12 = _mm256_packus_epi32(L1, L2);
pack34 = _mm256_packus_epi32(L3, L4);

packed = _mm256_packus_epi16(pack12, pack34);  // note the different width: zero-padded-16 -> 8

Vec = permute(packed)  // fix DWORD order in the vector (only needed for 256b version)

Vec = shift(Vec, bit_wanted)
bitvec = movemask(Vec)

    // shift:
    //  I guess word or dword granularity is fine, since byte granularity isn't available.
    //  You only care about the high bit, so it doesn't matter than you're not shifting zeroes into the bottom of each byte.

    // _mm_slli_epi32(Vec, imm8): 1 uop, 1c latency if your count is a compile-time constant.
    // _mm_sll_epi32 (Vec, _mm_cvtsi32_si128(count)): 2uop 2c latency if it's variable.

    // *not* _mm_sllv_epi32(): slower: different shift count for each element.

If you're doing this with just AVX (like you said) then you won't have 256b integer instructions available. Just build 128b vectors, and get 16b at a time of mask data. You won't need a final permute at the end.

Merge masks with integer instructions: (m2<<16) | m1. If desired, even go up to 64b of mask data, by combining two 32b masks.

Performance: This avoids the need for separate load instructions with AVX, since vpand can micro-fuse a memory operand if used with a one-register addressing mode.

cycle 1: 3 vpand instructions. (or only 2, if we were waiting on the address, since there's only 2 load ports.)
cycle 2: last one or two vpand, one pack (L1, L2)
cycle 3: next pack (L3, L4)
cycle 4: final pack
// 256b AVX2: permute
cycle 5: packed shift with imm8 count: 1 uop, 1c latency.
cycle 6: movemask (3 cycle latency)

Latency = 8 (SnB and later)

Throughput: 3 shuffles (p5), 4 logicals (p015), 1 shift (p0), 1 pmovmsk (p0). 4 load uops.

SnB/IvB: 9 ALU uops -> 3c. 4 memory reads: 2c.
So depending on what you're doing with the masks, 3 accumulators would be needed to keep the execution ports saturated. (ceil(8/3) = 3.).

With shift count in a variable that can't be resolved to a compile-time constant by compiler inlining / unrolling: latency = 9. And the shift produces another uop for p1/p5.

With AVX2 for Haswell and later, there's another 3 extra latency for the vpermd.

davlet · Answer

You can try unrolling that loop, this should at least get rid of one comparison (i<32), one increment (i++) and one multiplication (i*4) in the loop's body. Also constant array offsets might work slightly faster than variable. But note that your compiler might generate similar (or better) code anyway, with the appropriate compilation options enabled.

union {  __m256i   reg;   uint8_t bytes[32]; } aux;
...
aux.bytes[0] = data[0];
aux.bytes[1] = data[3];
...
aux.bytes[31] = data[124];

Efficiently gather individual bytes, separated by a byte-stride of 4

Tags:

c

avx

intrinsics

BlueStrat

3 Answers

Elalfer

Peter Cordes

davlet

Recent Activity

Donate For Us

Efficiently gather individual bytes, separated by a byte-stride of 4

Tags:

c

avx

intrinsics

BlueStrat

3 Answers

Elalfer

Peter Cordes

davlet

Related questions

Recent Activity

Donate For Us