I wanted to try out some intrinsics for the x86 BMI set. The grep bmi /proc/cpuinfo
shows both bmi1
and bmi2
in my AMD Ryzen CPU. But I cannot get clang to compile some of the instructions, in particular BLSI & BLSR. It looks like they are not supported in clang's bmiintrin.h
. Is that indeed so or do I miss something? In general, do you need to install some kind of "plugin" for LLVM from Intel/AMD or something like that to use CPU-specific features? Is it better to use their build tools in this case?
Following this article, I build a test program with BLSI or BLSR unstructions:
// test_bmi.c
#include <x86intrin.h>
// not #include <bmiintrin.h> - clang errors and asks for x86intrin.h
volatile unsigned long long result;
main() {
...
for (unsigned long long i=0; i<max_count; i++) {
result = _blsi_u64(i);
}
}
It's built with -march=native
to turn on all of the CPU features:
clang -march=native test_bmi.c -o test_bmi
But there are no blsi
-like instructions in objdump -d test_bmi
assembly. Looking at the bliintrin.h source, it seems the BLSI and BLSR instructions are not actually supported:
static __inline__ unsigned long long __DEFAULT_FN_ATTRS
__blsi_u64(unsigned long long __X)
{
return __X & -__X;
}
But, for example, BEXTR is in the header and it does show up in the objdump
assembly:
static __inline__ unsigned long long __DEFAULT_FN_ATTRS
__bextr_u64(unsigned long long __X, unsigned long long __Y)
{
return __builtin_ia32_bextr_u64(__X, __Y);
}
$ objdump -d test_bmi | grep bextr
12c5: c4 e2 f0 f7 c0 bextr %rcx,%rax,%rax
Does it mean that clang does not really support the BLSI & BLSR instructions? Is that on purpose or did I miss something to enable them?
No special intrinsics are needed as clang knows to use these instructions and others like it (including andn
, bextr
, popcnt
(!), blsi
, blsmsk
, blsr
, and tzcnt
(!) and others) if you just code out their behaviour in C.
For example, you can write
int my_blsi(int x)
{
return (x & -x);
}
and find that the compiler turns this into something like
my_blsi:
blsil %edi, %eax
ret
This peephole analysis is pretty powerful and can even recognise common implementations of popcnt
(both with a loop and with bit manipulation), tzcnt
, and similar instructions.
So just write code like you normally would and compile for a target architecture that supports the instructions you are looking for. The compiler will use them where appropriate automatically.
Note that you may need to compile with optimisations for these optimisations to trigger; this could explain your initial failure to get blsi
generated.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With