How can a literal 0 and 0 as a variable yield different behavior with the function __builtin_clz?

Question

There's only 1 circumstance where __builtin_clz gives the wrong answer. I'm curious what's causing that behavior.

When I use the literal value 0 I always get 32 as expected. But 0 as a variable yields 31. Why does the method of storing the value 0 matter?

I've taken an architecture class but don't understand the diffed assembly. It looks like when given the literal value 0, the assembly somehow always has the correct answer of 32 hard coded even without optimizations. And the method for counting leading zeros is different when using -march=native.

This post about emulating __builtin_clz with _BitScanReverse and the line bsrl %eax, %eax seem to imply that bit scan reverse doesn't work for 0.

+-------------------+-------------+--------------+
|      Compile      | literal.cpp | variable.cpp |
+-------------------+-------------+--------------+
| g++               |          32 |           31 |
| g++ -O            |          32 |           32 |
| g++ -march=native |          32 |           32 |
+-------------------+-------------+--------------+

literal.cpp

#include <iostream>

int main(){
    int i = 0;
    std::cout << __builtin_clz(0) << std::endl;
}

variable.cpp

#include <iostream>

int main(){
    int i = 0;
    std::cout << __builtin_clz(i) << std::endl;
}

Diff for g++ -S [in name] -o [out name]

1c1
<       .file   "literal.cpp"
---
>       .file   "variable.cpp"
23c23,26
<       movl    $32, %esi
---
>       movl    -4(%rbp), %eax
>       bsrl    %eax, %eax
>       xorl    $31, %eax
>       movl    %eax, %esi

Diff for g++ -march=native -S [in name] -o [out name]

1c1
<       .file   "literal.cpp"
---
>       .file   "variable.cpp"
23c23,25
<       movl    $32, %esi
---
>       movl    -4(%rbp), %eax
>       lzcntl  %eax, %eax
>       movl    %eax, %esi

Diff for g++ -O -S [in name] -o [out name]

1c1
<       .file   "literal.cpp"
---
>       .file   "variable.cpp"

Peter Cordes · Accepted Answer

When you compile with optimization disabled, the compiler doesn't do constant-propagation across statements. That part is a duplicate of Why does integer division by -1 (negative one) result in FPE? - read my answer there, and/or Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?

This is why a literal zero can be different from a variable with value = 0. Only the variable with optimization disabled results in runtime bsr+xor $31, %reg.

As documented in the GCC manual for __builtin_clz

Returns the number of leading 0-bits in x, starting at the most significant bit position. If x is 0, the result is undefined.

This allows clz / ctz to compile to 31-bsr or bsf instructions respectively, on x86. 31-bsr is implemented with bsr+xor $31,%reg thanks to the magic of 2's complement. (BSR produces the index of the highest set bit, not the leading-zero count).

Note that it only says result, not behaviour. It's not C++ UB (whole program could do absolutely anything), it's limited to that result, just like in x86 asm. But anyway, it seems when the input is a compile-time constant 0, GCC produces the type-width like x86 lzcnt, and like clz instructions on other ISAs. (This probably happens in target-independent GIMPLE tree optimization where constant-propagation through operations including builtins is done.)

Intel documents bsf/bsr as If the content source operand is 0, the content of the destination operand is undefined. In real life, Intel hardware implements the same behaviour AMD documents: leave the destination unmodified in that case.

But since Intel refuses to document it, compilers won't let you make code which takes advantage of it. GCC does not know or care about that behaviour, and provides no way to take advantage of it. Neither does MSVC even though its intrinsic that takes an output pointer arg so could easily have worked that way. See VS: unexpected optimization behavior with _BitScanReverse64 intrinsic

With -march=native, GCC can use BMI1 lzcnt directly, which is well-defined for every possible input bit pattern including 0. It directly produces a leading-zero count instead of the index of the first set bit.

(This is why BSR/BSF don't make sense for input=0; there is no index for them to find. Fun fact: bsr %eax, %eax does "work" for eax=0. In asm, the instructions also set ZF according to whether the input was zero so you can detect when the output is "undefined" instead of a separate test+branch before bsr. Or on AMD and everything else in real life, that it left the destination unmodified.)

How can a literal 0 and 0 as a variable yield different behavior with the function __builtin_clz?

Tags:

c++

gcc

assembly

undefined-behavior

intrinsics

literal.cpp

variable.cpp

Diff for g++ -S [in name] -o [out name]

Diff for g++ -march=native -S [in name] -o [out name]

Diff for g++ -O -S [in name] -o [out name]

Matt

1 Answers

Further reading about bsf vs. lzcnt and false dependencies

Peter Cordes

Recent Activity

Donate For Us

How can a literal 0 and 0 as a variable yield different behavior with the function __builtin_clz?

Tags:

c++

gcc

assembly

undefined-behavior

intrinsics

literal.cpp

variable.cpp

Diff for g++ -S [in name] -o [out name]

Diff for g++ -march=native -S [in name] -o [out name]

Diff for g++ -O -S [in name] -o [out name]

Matt

1 Answers

Further reading about bsf vs. lzcnt and false dependencies

Peter Cordes

Related questions

Recent Activity

Donate For Us