Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

No intrinsics for x86 BMI instructions BLSI & BLSR in Clang?

I wanted to try out some intrinsics for the x86 BMI set. The grep bmi /proc/cpuinfo shows both bmi1 and bmi2 in my AMD Ryzen CPU. But I cannot get clang to compile some of the instructions, in particular BLSI & BLSR. It looks like they are not supported in clang's bmiintrin.h. Is that indeed so or do I miss something? In general, do you need to install some kind of "plugin" for LLVM from Intel/AMD or something like that to use CPU-specific features? Is it better to use their build tools in this case?

Following this article, I build a test program with BLSI or BLSR unstructions:

// test_bmi.c
#include <x86intrin.h>
// not #include <bmiintrin.h> - clang errors and asks for x86intrin.h
volatile unsigned long long result;

main() {
  ...
  for (unsigned long long i=0; i<max_count; i++) {
    result = _blsi_u64(i);
  }
}

It's built with -march=native to turn on all of the CPU features:

clang -march=native test_bmi.c -o test_bmi

But there are no blsi-like instructions in objdump -d test_bmi assembly. Looking at the bliintrin.h source, it seems the BLSI and BLSR instructions are not actually supported:

static __inline__ unsigned long long __DEFAULT_FN_ATTRS
__blsi_u64(unsigned long long __X)
{
  return __X & -__X;
}

But, for example, BEXTR is in the header and it does show up in the objdump assembly:

static __inline__ unsigned long long __DEFAULT_FN_ATTRS
__bextr_u64(unsigned long long __X, unsigned long long __Y)
{
  return __builtin_ia32_bextr_u64(__X, __Y);
}

$ objdump -d test_bmi | grep bextr
    12c5:  c4 e2 f0 f7 c0         bextr  %rcx,%rax,%rax

Does it mean that clang does not really support the BLSI & BLSR instructions? Is that on purpose or did I miss something to enable them?

like image 406
xealits Avatar asked Sep 06 '25 03:09

xealits


1 Answers

No special intrinsics are needed as clang knows to use these instructions and others like it (including andn, bextr, popcnt (!), blsi, blsmsk, blsr, and tzcnt (!) and others) if you just code out their behaviour in C.

For example, you can write

int my_blsi(int x)
{
    return (x & -x);
}

and find that the compiler turns this into something like

my_blsi:
    blsil   %edi, %eax
    ret

This peephole analysis is pretty powerful and can even recognise common implementations of popcnt (both with a loop and with bit manipulation), tzcnt, and similar instructions.

So just write code like you normally would and compile for a target architecture that supports the instructions you are looking for. The compiler will use them where appropriate automatically.

Note that you may need to compile with optimisations for these optimisations to trigger; this could explain your initial failure to get blsi generated.

like image 58
fuz Avatar answered Sep 07 '25 20:09

fuz