How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?


The intrinsic:

int mask = _mm256_movemask_epi8(__m256i s1) 

creates a mask, with its 32 bits corresponding to the most significant bit of each byte of s1. After manipulating the mask using bit operations (BMI2 for example) I would like to perform the inverse of _mm256_movemask_epi8, i.e., create a __m256i vector with the most significant bit of each byte containing the corresponding bit of the uint32_t mask.

What is the best way to do this?

Edit: I need to perform the inverse because the intrinsic _mm256_blendv_epi8 accepts only __m256i type mask instead of uint32_t. As such, in the resulting __m256i mask, I can ignore the bits other than the MSB of each byte.

I have implemented the above three approaches on a Haswell machine. Evgeny Kluev's approach is the fastest (1.07 s), followed by Jason R's (1.97 s) and Paul R's (2.44 s). The code below was compiled with -march=core-avx2 -O3 optimization flags.

#include <immintrin.h> #include <boost/date_time/posix_time/posix_time.hpp>  //t_icc = 1.07 s //t_g++ = 1.09 s __m256i get_mask3(const uint32_t mask) {   __m256i vmask(_mm256_set1_epi32(mask));   const __m256i shuffle(_mm256_setr_epi64x(0x0000000000000000,       0x0101010101010101, 0x0202020202020202, 0x0303030303030303));   vmask = _mm256_shuffle_epi8(vmask, shuffle);   const __m256i bit_mask(_mm256_set1_epi64x(0x7fbfdfeff7fbfdfe));   vmask = _mm256_or_si256(vmask, bit_mask);   return _mm256_cmpeq_epi8(vmask, _mm256_set1_epi64x(-1)); }  //t_icc = 1.97 s //t_g++ = 1.97 s __m256i get_mask2(const uint32_t mask) {   __m256i vmask(_mm256_set1_epi32(mask));   const __m256i shift(_mm256_set_epi32(7, 6, 5, 4, 3, 2, 1, 0));   vmask = _mm256_sllv_epi32(vmask, shift);   const __m256i shuffle(_mm256_setr_epi64x(0x0105090d0004080c,       0x03070b0f02060a0e, 0x0105090d0004080c, 0x03070b0f02060a0e));   vmask = _mm256_shuffle_epi8(vmask, shuffle);   const __m256i perm(_mm256_setr_epi64x(0x0000000000000004, 0x0000000100000005,       0x0000000200000006, 0x0000000300000007));   return _mm256_permutevar8x32_epi32(vmask, perm); }  //t_icc = 2.44 s //t_g++ = 2.45 s __m256i get_mask1(uint32_t mask) {   const uint64_t pmask = 0x8080808080808080ULL; // bit unpacking mask for PDEP   uint64_t amask0, amask1, amask2, amask3;    amask0 = _pdep_u64(mask, pmask);   mask >>= 8;   amask1 = _pdep_u64(mask, pmask);   mask >>= 8;   amask2 = _pdep_u64(mask, pmask);   mask >>= 8;   amask3 = _pdep_u64(mask, pmask);   return _mm256_set_epi64x(amask3, amask2, amask1, amask0); }  int main() {   __m256i mask;   boost::posix_time::ptime start(       boost::posix_time::microsec_clock::universal_time());    for(unsigned i(0); i != 1000000000; ++i)     {        mask = _mm256_xor_si256(mask, get_mask3(i));     }   boost::posix_time::ptime end(       boost::posix_time::microsec_clock::universal_time());   std::cout << "duration:" << (end-start) <<      " mask:" << _mm256_movemask_epi8(mask) << std::endl;   return 0; } 
