I'm a bit confused about both instructions. First let's discard the special case when the scanned value is 0 and the undefined/bsr or bitsize/lzcnt result - this difference is clear and not part of my question.
Let's take the binary value 0001 1111 1111 1111 1111 1111 1111 1111
According to Intel's spec the result for lzcnt
is 3
According to Intel's spec the result for bsr
is 28
lzcnt
counts, bsr
returns the index or distance from bit 0 (which is the lsb).
How can both instructions be the same and how can lzcnt
be emulated as bsr
in case there's no BMI on the CPU available? Or is bit 0 in case of bsr
the msb? Both "code operations" in Intel's spec are different too, one counts or indexes from the left, the other from the right.
Maybe someone can shed some light on this, I have no CPU without BMI/lzcnt
instruction to test if the fallback to bsr
works with the same result (as the special case of value 0 to scan never happens).
LZCNT
gives the number of leading zero bits. BSR
gives the bit index of the most significant 1 bit. So they do effectively the same thing for the non-zero case, except the result is interpreted differently. Therefore you can just subtract the BSR
result from 31 to get the same behaviour as with LZCNT
, i.e. LZCNT == (31 - BSR)
.
To be clear, there is no working fallback from lzcnt
to bsr
. What happened is that Intel used the previously redundant sequence rep bsr
to encode the new lzcnt
instruction. Using a redudant rep
prefix for bsr
was generally defined to be ignored, but with the caveat that it may decode differently on future CPUs1.
So if you happen to execute lzcnt
on a CPU that doesn't support it, it will execute as bsr
. Of course, this fallback is not exactly intentional, and it gives the wrong result (as Paul R points out they look at the same bit but report it differently): it is just a consequence of the way the new instruction was encoded and how pointless rep
prefixes were treated by prior CPUs. So the world fallback is pretty much entirely inappropriate for lzcnt
and bsr
.
The situation is more subtle for the case of tzcnt
and bsf
. It uses the same encoding trick: tzcnt
has the same encoding as rep bsf
, but here the "fallback" mostly works since tzcnt
returns the same value as bsf
for all inputs except zero. For zero inputs tzcnt
returns 32, but bsf leaves the destination undefined.
You can't really use even this fallback though: if you never have zero inputs you might as well just use bsf
, saving a byte and being compatible with a couple decades of CPUs, and if you do have zero inputs the behavior differs.
So the behavior is perhaps better classified as trivia than a fallback...
1 Normally this would more or less be esoterica, but you could for example use rep
prefixes are where they have no functional effect to lengthen instructions to help align subsequent code without inserting explicit nop
instructions. Given the "may decode differently in the future" this would be dangerous when compiling code which may run on any future CPU.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With