I'm trying to optimize an algorithm that will process massive datasets that could strongly benefit from AVX SIMD instructions. Unfortunately, the input memory layout is not optimal for the required computations. Information must be reordered, by assembling __m256i
values from individual bytes that are exactly 4 bytes apart:
BEGIN EDIT
My target CPUS do not support AVX2 instructions, so like @Elalfer and @PeterCordes pointed out, I can't make use of __m256i values, code must be converted to use __m128i values instead)
END EDIT
DataSet layout in memory
Byte 0 | Byte 1 | Byte 2 | Byte 3
Byte 4 | Byte 5 | Byte 6 | Byte 7
...
Byte 120 | Byte 121 | Byte 122 | Byte 123
Byte 124 | Byte 125 | Byte 126 | Byte 127
Desired values in __m256i
variable:
| Byte 0 | Byte 4 | Byte 8 | ... | Byte 120 | Byte 124 |
Is there a more efficient way to gather and rearrange the strided data other than this straightforward code?
union { __m256i reg; uint8_t bytes[32]; } aux;
...
for( int i = 0; i < 32; i++ )
aux.bytes[i] = data[i * 4];
Edit:
The step I'm trying to optimize is a bit column transposition; in other words, the bits of a certain column (32 possible bit columns in my data arrangement) should become a single uint32_t
value, while the rest of the bits are ignored.
I perform the transposition by rearranging the data as shown, performing a left shift to bring the desired bit column as the most significant bits in each sub-byte, and finally extract and assemble the bits into a single uint32
_t value via the _mm256_movemask_epi8()
intrinsic.
One of the ways would be - pack the bytes with _mm256_shuffle_epi8
, blend all _mm256_blend_epi32
resulting vectors (you'll need to do 4 such load+shuffle), and do a single 32bit permute _mm256_permutevar8x32_epi32
.
Here is a pseudo code (I hope you can come up with the shuffle masks):
L1 = load32byte(buf)
L2 = load32byte(buf+32)
L3 = load32byte(buf+64)
L4 = load32byte(buf+96)
// Pack 4 bytes in the corresponding 32bit DWORD in each lane and zero-out other bytes
L1 = shuffle(L1, mask_for_L1)
L2 = shuffle(L2, mask_for_L2)
L3 = shuffle(L3, mask_for_L3)
L4 = shuffle(L4, mask_for_L4)
// Vec = blend(blend(L1,L2),blend(L3,L4))
Vec = or(or(or(L1,L2),L3),L4)
Vec = permute(Vec) // fix DWORD order in the vector
Update: Forgot the reason I said "zero-out other bytes" - this way you can replace blend
with or
Update: Reduced one cycle latency by rearranging or
operations per Peter's comment below.
PS. I'd also recommend you to take a look at the BMI Instruction Set as you do bit manipulations.
I only just noticed the edit, which has a special-case answer.
If you need to do many different bit positions on the same data, then your current plan is good.
If you only need one bit position (esp. the highest bit position) from 128B of memory, you could use _mm256_movemask_ps
to get the high bit from each 32b element. Then combine four 8bit masks in GP registers.
A good compiler should optimize that to:
vmovdqu ymm0, [buf + 0]
; to select a different bit:
; vpslld ymm0, ymm0, count ; count can be imm8 or the low byte of an xmm register
vmovmskps eax, ymm0
vmovdqu ymm0, [buf + 32]
vmovmskps ebx, ymm0
... ecx and edx
mov ah, bl
mov ch, dl
shl ecx, 16
or eax, ecx
This is nice only if you're testing the high bit (so you don't need to shift each vector before vmovmsk
). Even so, this is probably more instructions (and code size) than the other solution.
Answer to the original question:
Similar to Elalfer's idea, but use the shuffle unit for pack
instructions instead of pshufb
. Also, all the ANDs are independent, so they can execute in parallel. Intel CPUs can do 3 ANDs at once, but only one shuffle. (Or two shuffles at once on pre-Haswell.)
// without AVX2: you won't really be able to
// do anything with a __m256i, only __m128i
// just convert everything to regular _mm_..., and leave out the final permute
mask = _mm256_set1_epi32(0x000000ff);
// same mask for all, and the load can fold into the AND
// You can write the load separately if you like, it'll still fold
L1 = and(mask, (buf)) // load and zero the bytes we don't want
L2 = and(mask, (buf+32))
L3 = and(mask, (buf+64))
L4 = and(mask, (buf+96))
// squish dwords from 2 concatenated regs down to words in 1 reg
pack12 = _mm256_packus_epi32(L1, L2);
pack34 = _mm256_packus_epi32(L3, L4);
packed = _mm256_packus_epi16(pack12, pack34); // note the different width: zero-padded-16 -> 8
Vec = permute(packed) // fix DWORD order in the vector (only needed for 256b version)
Vec = shift(Vec, bit_wanted)
bitvec = movemask(Vec)
// shift:
// I guess word or dword granularity is fine, since byte granularity isn't available.
// You only care about the high bit, so it doesn't matter than you're not shifting zeroes into the bottom of each byte.
// _mm_slli_epi32(Vec, imm8): 1 uop, 1c latency if your count is a compile-time constant.
// _mm_sll_epi32 (Vec, _mm_cvtsi32_si128(count)): 2uop 2c latency if it's variable.
// *not* _mm_sllv_epi32(): slower: different shift count for each element.
If you're doing this with just AVX (like you said) then you won't have 256b integer instructions available. Just build 128b vectors, and get 16b at a time of mask data. You won't need a final permute at the end.
Merge masks with integer instructions: (m2<<16) | m1
. If desired, even go up to 64b of mask data, by combining two 32b masks.
Performance: This avoids the need for separate load instructions with AVX, since vpand
can micro-fuse a memory operand if used with a one-register addressing mode.
vpand
instructions. (or only 2, if we were waiting on the address, since there's only 2 load ports.)vpand
, one pack
(L1, L2)pack
(L3, L4)pack
Latency = 8 (SnB and later)
Throughput: 3 shuffles (p5), 4 logicals (p015), 1 shift (p0), 1 pmovmsk (p0). 4 load uops.
With shift count in a variable that can't be resolved to a compile-time constant by compiler inlining / unrolling: latency = 9. And the shift produces another uop for p1/p5.
With AVX2 for Haswell and later, there's another 3 extra latency for the vpermd
.
You can try unrolling that loop, this should at least get rid of one comparison (i<32), one increment (i++) and one multiplication (i*4) in the loop's body. Also constant array offsets might work slightly faster than variable. But note that your compiler might generate similar (or better) code anyway, with the appropriate compilation options enabled.
union { __m256i reg; uint8_t bytes[32]; } aux;
...
aux.bytes[0] = data[0];
aux.bytes[1] = data[3];
...
aux.bytes[31] = data[124];
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With