The Haswell architectures comes up with several new instructions. One of them is PEXT
(parallel bits extract) whose functionality is explained by this image (source here):
It takes a value r2
and a mask r3
and puts the extracted bits of r2
into r1
.
My question is the following: what would be the equivalent code of an optimized templated function in pure standard C++11, that would be likely to be optimized to this instruction by compilers in the future.
Here is some code from Matthew Fioravante's stdcxx-bitops GitHub repo that was floated to the std-proposals
mailinglist as a preliminary proposal to add a constexpr
bitwise operations library for C++.
#ifndef HAS_CXX14_CONSTEXPR
#define HAS_CXX14_CONSTEXPR 0
#endif
#if HAS_CXX14_CONSTEXPR
#define constexpr14 constexpr
#else
#define constexpr14
#endif
//Parallel Bits Extract
//x HGFEDCBA
//mask 01100100
//res 00000GFC
//x86_64 BMI2: PEXT
template <typename Integral>
constexpr14 Integral extract_bits(Integral x, Integral mask) {
Integral res = 0;
for(Integral bb = 1; mask != 0; bb += bb) {
if(x & mask & -mask) {
res |= bb;
}
mask &= (mask - 1);
}
return res;
}
As of August 2022, there still does not seem to be a way to write code for which a compiler will generate a PEXT
instruction.
However, at this point (C++20), you can call many functions in <bit>
that wrap assembly code.
In general, you can get a compiler to generate assembly instructions, where supported, for all functions in TBM
/BMI
and for BZHI
from BMI2
, which all can be described with simple expressions.
PDEP
is non-trivial and has the same complexity as PDEP
, with multiple possible implementations. So, neither seem straightforward for an optimizer to recognize.
The ABM
instructions, POPCNT
and LZCNT
are also non-trivial, and implementations could not be recognized. Fortunately, we have std::popcount
and std::countl_zero
, respectively, which can map directly to the corresponding instruction (where the hardware supports it).
It would seem that PEXT
and PDEP
will likely be similarly supported before the time compilers can infer that an instruction can replace the algorithm.
Now that C++ is officially two's compliment, it would be nice to see both arithmetic and logical shift right wrapped so that one could explicitly use them on signed or unsigned, as required.
As far as PEXT implementations go, here's a variation that might compare favorably to TemplateRex's (Matthew Fioravante's) version at https://stackoverflow.com/a/21159523/2963099
template <typename Integral>
constexpr Integral extract_bits(Integral x, Integral mask) {
Integral res=0;
int bb=1;
do {
Integral lsb=mask & -mask;
mask &= ~lsb;
bool isset=x & lsb;
res |= isset ? bb : 0;
bb+=bb;
} while (mask);
return res;
}
You can compare them both on Compiler Explorer at https://godbolt.org/z/3h9WrYqxT
Mostly this breaks out the least significant byte (lsb
) to remove it from mask
, and test against x
It is safe to run through once with mask === 0
. (lsb
will be 0, so isset
is false
). Using a do while
is much more efficient except for that trivial case.
Using the ternary operator is mostly stylist since, to me, it is a stronger hint of the intent to generate a cmove
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With