There's only 1 circumstance where __builtin_clz
gives the wrong answer. I'm curious what's causing that behavior.
When I use the literal value 0 I always get 32 as expected. But 0 as a variable yields 31. Why does the method of storing the value 0 matter?
I've taken an architecture class but don't understand the diffed assembly. It looks like when given the literal value 0, the assembly somehow always has the correct answer of 32 hard coded even without optimizations. And the method for counting leading zeros is different when using -march=native.
This post about emulating __builtin_clz
with _BitScanReverse
and the line bsrl %eax, %eax
seem to imply that bit scan reverse doesn't work for 0.
+-------------------+-------------+--------------+
| Compile | literal.cpp | variable.cpp |
+-------------------+-------------+--------------+
| g++ | 32 | 31 |
| g++ -O | 32 | 32 |
| g++ -march=native | 32 | 32 |
+-------------------+-------------+--------------+
#include <iostream>
int main(){
int i = 0;
std::cout << __builtin_clz(0) << std::endl;
}
#include <iostream>
int main(){
int i = 0;
std::cout << __builtin_clz(i) << std::endl;
}
1c1
< .file "literal.cpp"
---
> .file "variable.cpp"
23c23,26
< movl $32, %esi
---
> movl -4(%rbp), %eax
> bsrl %eax, %eax
> xorl $31, %eax
> movl %eax, %esi
1c1
< .file "literal.cpp"
---
> .file "variable.cpp"
23c23,25
< movl $32, %esi
---
> movl -4(%rbp), %eax
> lzcntl %eax, %eax
> movl %eax, %esi
1c1
< .file "literal.cpp"
---
> .file "variable.cpp"
When you compile with optimization disabled, the compiler doesn't do constant-propagation across statements. That part is a duplicate of Why does integer division by -1 (negative one) result in FPE? - read my answer there, and/or Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?
This is why a literal zero can be different from a variable with value = 0.
Only the variable with optimization disabled results in runtime bsr+xor $31, %reg
.
As documented in the GCC manual for __builtin_clz
Returns the number of leading 0-bits in x, starting at the most significant bit position. If
x
is 0, the result is undefined.
This allows clz
/ ctz
to compile to 31-bsr
or bsf
instructions respectively, on x86. 31-bsr
is implemented with bsr
+xor $31,%reg
thanks to the magic of 2's complement. (BSR produces the index of the highest set bit, not the leading-zero count).
Note that it only says result, not behaviour. It's not C++ UB (whole program could do absolutely anything), it's limited to that result, just like in x86 asm. But anyway, it seems when the input is a compile-time constant 0, GCC produces the type-width like x86 lzcnt
, and like clz
instructions on other ISAs. (This probably happens in target-independent GIMPLE tree optimization where constant-propagation through operations including builtins is done.)
Intel documents bsf
/bsr
as If the content source operand is 0, the content of the destination operand is undefined. In real life, Intel hardware implements the same behaviour AMD documents: leave the destination unmodified in that case.
But since Intel refuses to document it, compilers won't let you make code which takes advantage of it. GCC does not know or care about that behaviour, and provides no way to take advantage of it. Neither does MSVC even though its intrinsic that takes an output pointer arg so could easily have worked that way. See VS: unexpected optimization behavior with _BitScanReverse64 intrinsic
With -march=native
, GCC can use BMI1 lzcnt
directly, which is well-defined for every possible input bit pattern including 0
. It directly produces a leading-zero count instead of the index of the first set bit.
(This is why BSR/BSF don't make sense for input=0; there is no index for them to find. Fun fact: bsr %eax, %eax
does "work" for eax=0
. In asm, the instructions also set ZF according to whether the input was zero so you can detect when the output is "undefined" instead of a separate test+branch before bsr
. Or on AMD and everything else in real life, that it left the destination unmodified.)
On Intel until Skylake, lzcnt
/ tzcnt
have a false dependency on the output register even though the result does not depend on it ever. IIRC, Coffee Lake also fixed the false dep for popcnt
. (All of which runs on the same execution unit as BSR/BSF.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With