This follows as a result of experimenting on Compiler Explorer as to ascertain the compiler's (rustc's) behaviour when it comes to the log2()
/leading_zeros()
and similar functions. I came across this result with seems exceedingly both bizarre and concerning:
Compiler Explorer link
Code:
pub fn lzcnt0(val: u64) -> u64 {
val.leading_zeros() as u64
}
pub unsafe fn lzcnt1(val: u64) -> u64 {
core::arch::x86_64::_lzcnt_u64(val)
}
pub unsafe fn lzcnt2(val: u64) -> u64 {
asm_lzcnt(val)
}
#[inline]
pub unsafe fn asm_lzcnt(val: u64) -> u64 {
let lzcnt: u64;
core::arch::asm!("lzcnt {}, {}", in(reg) val, lateout(reg) lzcnt, options(nomem, nostack));
lzcnt
}
Output:
example::lzcnt0:
test rdi, rdi
je .LBB0_2
bsr rax, rdi
xor rax, 63
ret
.LBB0_2:
mov eax, 64
ret
example::lzcnt1:
jmp core::core_arch::x86_64::abm::_lzcnt_u64
core::core_arch::x86_64::abm::_lzcnt_u64:
lzcnt rax, rdi
ret
example::lzcnt2:
lzcnt rdi, rax
ret
The compiler options are to best emulate cargo's 'release' configuration (with opt-level=3 for good measure), and otherwise trying my best to get the compiler to optimise the functions. The specific target shouldn't matter, as long as it targets x86-64, I've tried x86_64-{pc-windows-{msvc,gnu},unknown-linux-gnu}
.
All of these outputs should be identical to lzcnt2
. Instruction Performance Tables lzcnt
is evidently a fast instruction across the board and should be used, and having an unnecessary branch in such a low level function is dismal. What's weirder, the function _lzcnt_u64()
calls leading_zeros()
under the hood - which the compiler is happy to magic away (there's no checks or asserts either), but won't seem to do it for the underlying function. What's more, the compiler won't inline the lzcnt
instruction even in that case? (the implementation marks the function a #[inline]
too) Sure, a jmp
isn't as bad, but it's entirely unnecessary as should be avoided.
I'm seeing similar results in functions like log2
and (I presume) others that rely on the ctlz
rust compiler intrinsic in their implementation.
If you understand compilers sufficiently, any clarification would be greatly appreciated. I don't fancy writing loads of utility functions for little reason, but I'll do so if there's no better alternative.
P.S. If your answer is along the lines of that the performance gain is negligible in most situations, and/or that I shouldn't care due to code quality or similar reasoning: I understand the sentiment, but that's not the point of this question. I'm writing for bare-metal, hot code in a personal project.
Old x86-64 CPUs don't support lzcnt
, so rustc/llvm won't emit it by default. (They would execute it as bsr
but the behavior is not identical.)
Use -C target-feature=+lzcnt
to enable it. Try.
More generally, you may wish to use -C target-cpu=XXX
to enable all the features of a specific CPU model. Use rustc --print target-cpus
for a list.
In particular, -C target-cpu=native
will generate code for the CPU that rustc itself is running on, e.g. if you will run the code on the same machine where you are compiling it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With