Problem: I have 4 x 256-bit AVX2 vectors (A, B, C, D) and I need to perform a swaping operation of their respective 128-bit parts and between two different vectors. Here is the transformation I need to do.
Original Transformed
|| Low Lane || High Lane|| || Low Lane || High Lane||
A = || L1 || H1 || = > || L1 || L2 ||
B = || L2 || H2 || = > || H1 || H2 ||
C = || L3 || H3 || = > || L3 || L4 ||
D = || L4 || H4 || = > || H3 || H4 ||
Visualization
Basically I need to store output in the following order L1, L2, L3, L4, H1, H2, H3, H4 to an array.
My current solution is using:
4x _mm256_blend_epi32 (worst-case: latency 1, throughput 0.35)
4x _mm256_permute2x128_si256 (worst-case: latency 3, throughput 1)
// (a, c) = block0, (b, d) = block1
a = Avx2.Permute2x128(a, a, 1);
var template = Avx2.Blend(a, b, 0b1111_0000); // H1 H2
a = Avx2.Blend(a, b, 0b0000_1111); // L2 l1
a = Avx2.Permute2x128(a, a, 1); // L1 l2
b = template;
c = Avx2.Permute2x128(c, c, 1);
template = Avx2.Blend(c, d, 0b1111_0000); // H3 H4
c = Avx2.Blend(c, d, 0b0000_1111); // L4 L3
c = Avx2.Permute2x128(c, c, 1); // L3 l4
d = template;
// Store keystream into buffer (in corrected order = [block0, block1])
Avx2.Store(outputPtr, a);
Avx2.Store(outputPtr + Vector256<uint>.Count, c);
Avx2.Store(outputPtr + Vector256<uint>.Count * 2, b);
Avx2.Store(outputPtr + Vector256<uint>.Count * 3, d);
Note: I'm using C#/NetCore to do AVX2 if you are wondering! Feel free to use examples in C/C++.
Is there any better or more efficient way to do it?
Edit
Accepted answer as C#
var tmp = Avx2.Permute2x128(a, b, 0x20);
b = Avx2.Permute2x128(a, b, 0x31);
a = tmp;
tmp = Avx2.Permute2x128(c, d, 0x20);
d = Avx2.Permute2x128(c, d, 0x31);
c = tmp;
If I understand you correctly, I think you could get away without the blend instructions for this 2x4 transpose, creating new variables that pick the lanes you want. Something like:
__m256i a; // L1 H1
__m256i b; // L2 H2
__m256i c; // L3 H3
__m256i d; // L4 H4
__m256i A = _mm256_permute2x128_si256(a, b, 0x20); // L1 L2
__m256i B = _mm256_permute2x128_si256(a, b, 0x31); // H1 H2
__m256i C = _mm256_permute2x128_si256(c, d, 0x20); // L3 L4
__m256i D = _mm256_permute2x128_si256(c, d, 0x31); // H3 H4
You still have the 3-cycle latency of the vperm2i128
instruction, but you always have that when you have data crossing 128-bit lanes. These 4 shuffles are independent so they can pipeline (ILP); Intel and Zen 2 have 1/clock throughput for vperm2i128
(https://agner.org/optimize/, https://uops.info/).
If you're lucky, a compiler will optimize the L1,L2 and L3,L4 shuffles into vinserti128
which AMD Zen 1 runs much more efficiently (1 uop instead of 8; lane-crossing shuffles get split into multiple 128-bit uops.)
These 4 shuffles take 4 uops for the shuffle port (port 5 on Intel); Intel and Zen2 have only 1/clock shuffle throughput for these shuffles. If that would be a bottleneck in your loop, consider @chtz's answer which costs more front-end throughput by doing 2 shuffles to line up the 4 lanes that need to move in preparation for cheap blends (vpblendd
). Related: What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?
You can do your operation with two permutes and 4 blends, giving an absolute throughput of 2 cycles:
void foo(
__m256i a, // L1 H1
__m256i b, // L2 H2
__m256i c, // L3 H3
__m256i d, // L4 H4
__m256i* outputPtr
)
{
// permute. Port usage: 1*p5, Latency 3 on both inputs
__m256i BA = _mm256_permute2x128_si256(a, b, 0x21); // H1 L2
__m256i DC = _mm256_permute2x128_si256(c, d, 0x21); // H3 L4
// blend. Port usage: 1*p015, Latency 1 on both inputs
__m256i A = _mm256_blend_epi32(a, BA, 0xf0); // L1 L2
__m256i B = _mm256_blend_epi32(BA, b, 0xf0); // H1 H2
__m256i C = _mm256_blend_epi32(c, DC, 0xf0); // L3 L4
__m256i D = _mm256_blend_epi32(DC, d, 0xf0); // H3 H4
_mm256_store_si256(outputPtr+0, A);
_mm256_store_si256(outputPtr+1, B);
_mm256_store_si256(outputPtr+2, C);
_mm256_store_si256(outputPtr+3, D);
}
However, depending on context (especially if a
, ..., d
are originally read from memory), it may also be better to use a sequence of vmovdqu
and vinserti128
instructions with m128
memory operands. You'll have twice as many loads, but no interlane latency and no bottle-neck on port 5 -- regarding latency and port-usage a memory-based vinsert128
behaves like a blend.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With