Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Standard C++11 code equivalent to the PEXT Haswell instruction (and likely to be optimized by compiler)

The Haswell architectures comes up with several new instructions. One of them is PEXT (parallel bits extract) whose functionality is explained by this image (source here):

pext

It takes a value r2 and a mask r3 and puts the extracted bits of r2 into r1.

My question is the following: what would be the equivalent code of an optimized templated function in pure standard C++11, that would be likely to be optimized to this instruction by compilers in the future.

like image 766
Vincent Avatar asked Jan 15 '14 17:01

Vincent


2 Answers

Here is some code from Matthew Fioravante's stdcxx-bitops GitHub repo that was floated to the std-proposals mailinglist as a preliminary proposal to add a constexpr bitwise operations library for C++.

#ifndef HAS_CXX14_CONSTEXPR
#define HAS_CXX14_CONSTEXPR 0
#endif

#if HAS_CXX14_CONSTEXPR
#define constexpr14 constexpr
#else
#define constexpr14
#endif

//Parallel Bits Extract
//x    HGFEDCBA
//mask 01100100
//res  00000GFC
//x86_64 BMI2: PEXT
template <typename Integral>
constexpr14 Integral extract_bits(Integral x, Integral mask) {
  Integral res = 0;
  for(Integral bb = 1; mask != 0; bb += bb) {
    if(x & mask & -mask) {
      res |= bb;
    }
    mask &= (mask - 1);
  }
  return res;
}
like image 141
TemplateRex Avatar answered Oct 11 '22 01:10

TemplateRex


As of August 2022, there still does not seem to be a way to write code for which a compiler will generate a PEXT instruction.

However, at this point (C++20), you can call many functions in <bit> that wrap assembly code.

In general, you can get a compiler to generate assembly instructions, where supported, for all functions in TBM/BMI and for BZHI from BMI2, which all can be described with simple expressions.

PDEP is non-trivial and has the same complexity as PDEP, with multiple possible implementations. So, neither seem straightforward for an optimizer to recognize.

The ABM instructions, POPCNT and LZCNTare also non-trivial, and implementations could not be recognized. Fortunately, we have std::popcount and std::countl_zero, respectively, which can map directly to the corresponding instruction (where the hardware supports it).

It would seem that PEXT and PDEP will likely be similarly supported before the time compilers can infer that an instruction can replace the algorithm.

Now that C++ is officially two's compliment, it would be nice to see both arithmetic and logical shift right wrapped so that one could explicitly use them on signed or unsigned, as required.

As far as PEXT implementations go, here's a variation that might compare favorably to TemplateRex's (Matthew Fioravante's) version at https://stackoverflow.com/a/21159523/2963099

template <typename Integral>
constexpr Integral extract_bits(Integral x, Integral mask) {
   Integral res=0;
   int bb=1;

   do {
        Integral lsb=mask & -mask;
        mask &= ~lsb;
        bool isset=x & lsb;
        res |= isset ? bb : 0;
        bb+=bb;
    } while (mask);
        
  return res;
}

You can compare them both on Compiler Explorer at https://godbolt.org/z/3h9WrYqxT

Mostly this breaks out the least significant byte (lsb) to remove it from mask, and test against x

It is safe to run through once with mask === 0. (lsb will be 0, so isset is false). Using a do while is much more efficient except for that trivial case.

Using the ternary operator is mostly stylist since, to me, it is a stronger hint of the intent to generate a cmove

like image 44
Glenn Teitelbaum Avatar answered Oct 11 '22 01:10

Glenn Teitelbaum