I am working on a data structure where I have an array of 16 uint64. They are laid out like this in memory (each below representing a single int64):
A0 A1 A2 A3 B0 B1 B2 B3 C0 C1 C2 C3 D0 D1 D2 D3
The desired result is to transpose the array into this:
A0 B0 C0 D0 A1 B1 C1 D1 A2 B2 C2 D2 A3 B3 C3 D3
The rotation of the array 90 degrees is also an acceptable solution for the future loop:
D0 C0 B0 A0 D1 C1 B1 A1 D2 C2 B2 A2 D3 C3 B3 A3
I need this in order to operate on the arrow fast at a later point (Traverse it sequentially with another SIMD trip, 4 at a time).
So far, I have tried to "blend" the data by loading up a 4 x 64 bit vector of A's, bitmaskising and shuffling the elements and OR'ing it with B's etc and then repeating that for C's... Unfortunately, this is 5 x 4 SIMD instructions per segment of 4 elements in the array (one load, one mask, one shuffle, one or with next element and finally a store). It seems I should be able to do better.
I have AVX2 available and I a compiling with clang.
uint64_t A[16] = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15};
__m256i row0 = _mm256_loadu_si256((__m256i*)&A[ 0]); //0 1 2 3
__m256i row1 = _mm256_loadu_si256((__m256i*)&A[ 4]); //4 5 6 7
__m256i row2 = _mm256_loadu_si256((__m256i*)&A[ 8]); //8 9 a b
__m256i row3 = _mm256_loadu_si256((__m256i*)&A[12]); //c d e f
I don't have hardware to test this on right now but something like the following should do what you want
__m256i tmp3, tmp2, tmp1, tmp0;
tmp0 = _mm256_unpacklo_epi64(row0, row1); //0 4 2 6
tmp1 = _mm256_unpackhi_epi64(row0, row1); //1 5 3 7
tmp2 = _mm256_unpacklo_epi64(row2, row3); //8 c a e
tmp3 = _mm256_unpackhi_epi64(row2, row3); //9 d b f
//now select the appropriate 128-bit lanes
row0 = _mm256_permute2x128_si256(tmp0, tmp2, 0x20); //0 4 8 c
row1 = _mm256_permute2x128_si256(tmp1, tmp3, 0x20); //1 5 9 d
row2 = _mm256_permute2x128_si256(tmp0, tmp2, 0x31); //2 6 a e
row3 = _mm256_permute2x128_si256(tmp1, tmp3, 0x31); //3 7 b f
The
__m256i _mm256_permute2x128_si256 (__m256i a, __m256i b, const int imm)
intrinsic selects 128-bit lanes from two sources. You can read about it in the Intel Intrinsic Guide. There is a version _mm256_permute2f128_si256
which only needs AVX and acts in the floating point domain. I used this to check that I used the correct control words.
An alternative is to use the gather instructions, you can load directly the transposed matrix. The five lines of code below are ok with gcc on a i7-Haswell.
int32_t stride = 4 * sizeof(A[0]);
__m128i r128_gather_idx = _mm_set_epi32(3 * stride, 2 * stride, 1 * stride, 0 * stride);
__m256i row0 = _mm256_i32gather_epi64(reinterpret_cast<long long const *>(&A[ 0]), r128_gather_idx, 1);
__m256i row1 = _mm256_i32gather_epi64(reinterpret_cast<long long const *>(&A[ 1]), r128_gather_idx, 1);
__m256i row2 = _mm256_i32gather_epi64(reinterpret_cast<long long const *>(&A[ 2]), r128_gather_idx, 1);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With