I encountered an unexpectedly early stack overflow and created the following program to test the issue:
#![feature(asm)]
#[inline(never)]
fn get_rsp() -> usize {
let rsp: usize;
unsafe {
asm! {
"mov {}, rsp",
out(reg) rsp
}
}
rsp
}
fn useless_function(x: usize) {
if x > 0 {
println!("{:x}", get_rsp());
useless_function(x - 1);
}
}
fn main() {
useless_function(10);
}
This is get_rsp
disassembled (according to cargo-asm):
tests::get_rsp:
push rax
#APP
mov rax, rsp
#NO_APP
pop rcx
ret
I'm not sure what #APP
and #NO_APP
do or why rax
is pushed and then popped into rcx
, but it seems the function does return the stack pointer.
I was surprised to find that in debug mode, the difference between two consecutively printed rsp
was 192(!) and even in release mode it was 128.
As far as I understand, all that needs to be stored for each call to useless_function
is one usize
and a return address, so I'd expect every stack frame to be around 16 bytes large.
I'm running this with rustc 1.46.0
on a 64-bit Windows machine.
Are my results consistent across machine? How is this explained?
It seems that the use of println!
has a pretty significant effect. In an attempt to avoid that, I changed the program (Thanks to @Shepmaster for the idea) to store the values in a static array:
static mut RSPS: [usize; 10] = [0; 10];
#[inline(never)]
fn useless_function(x: usize) {
unsafe { RSPS[x] = get_rsp() };
if x == 0 {
return;
}
useless_function(x - 1);
}
fn main() {
useless_function(9);
println!("{:?}", unsafe { RSPS });
}
The recursion gets optimised away in release mode, but in debug mode each frame still takes 80 bytes which is way more than I anticipated. Is this just the way stack frames work on x86? Do other languages do better? This seems a little inefficient.
Using formatting machinery like println!
creates a number of things on the stack. Expanding the macros used in your code:
fn useless_function(x: usize) {
if x > 0 {
{
::std::io::_print(::core::fmt::Arguments::new_v1(
&["", "\n"],
&match (&get_rsp(),) {
(arg0,) => [::core::fmt::ArgumentV1::new(
arg0,
::core::fmt::LowerHex::fmt,
)],
},
));
};
useless_function(x - 1);
}
}
I believe that those structs consume the majority of the space. As an attempt to prove that, I printed the size of the value created by format_args
, which is used by println!
:
let sz = std::mem::size_of_val(&format_args!("{:x}", get_rsp()));
println!("{}", sz);
This shows that it is 48 bytes.
See also:
Something like this should remove the printing from the equation, but the compiler / optimizer ignores the inline(never)
hint here and inlines it anyway, resulting in the sequential values all being the same.
/// SAFETY:
/// The length of `rsp` and the value of `x` must always match
#[inline(never)]
unsafe fn useless_function(x: usize, rsp: &mut [usize]) {
if x > 0 {
*rsp.get_unchecked_mut(0) = get_rsp();
useless_function(x - 1, rsp.get_unchecked_mut(1..));
}
}
fn main() {
unsafe {
let mut rsp = [0; 10];
useless_function(rsp.len(), &mut rsp);
for w in rsp.windows(2) {
println!("{}", w[0] - w[1]);
}
}
}
That said, you can make the function public and look at its assembly anyway (lightly cleaned):
playground::useless_function:
pushq %r15
pushq %r14
pushq %rbx
testq %rdi, %rdi
je .LBB6_3
movq %rsi, %r14
movq %rdi, %r15
xorl %ebx, %ebx
.LBB6_2:
callq playground::get_rsp
movq %rax, (%r14,%rbx,8)
addq $1, %rbx
cmpq %rbx, %r15
jne .LBB6_2
.LBB6_3:
popq %rbx
popq %r14
popq %r15
retq
but in debug mode each frame still takes 80 bytes
Compare the unoptimized assembly:
playground::useless_function:
subq $104, %rsp
movq %rdi, 80(%rsp)
movq %rsi, 88(%rsp)
movq %rdx, 96(%rsp)
cmpq $0, %rdi
movq %rdi, 56(%rsp) # 8-byte Spill
movq %rsi, 48(%rsp) # 8-byte Spill
movq %rdx, 40(%rsp) # 8-byte Spill
ja .LBB44_2
jmp .LBB44_8
.LBB44_2:
callq playground::get_rsp
movq %rax, 32(%rsp) # 8-byte Spill
xorl %eax, %eax
movl %eax, %edx
movq 48(%rsp), %rdi # 8-byte Reload
movq 40(%rsp), %rsi # 8-byte Reload
callq core::slice::<impl [T]>::get_unchecked_mut
movq %rax, 24(%rsp) # 8-byte Spill
movq 24(%rsp), %rax # 8-byte Reload
movq 32(%rsp), %rcx # 8-byte Reload
movq %rcx, (%rax)
movq 56(%rsp), %rdx # 8-byte Reload
subq $1, %rdx
setb %sil
testb $1, %sil
movq %rdx, 16(%rsp) # 8-byte Spill
jne .LBB44_9
movq $1, 72(%rsp)
movq 72(%rsp), %rdx
movq 48(%rsp), %rdi # 8-byte Reload
movq 40(%rsp), %rsi # 8-byte Reload
callq core::slice::<impl [T]>::get_unchecked_mut
movq %rax, 8(%rsp) # 8-byte Spill
movq %rdx, (%rsp) # 8-byte Spill
movq 16(%rsp), %rdi # 8-byte Reload
movq 8(%rsp), %rsi # 8-byte Reload
movq (%rsp), %rdx # 8-byte Reload
callq playground::useless_function
jmp .LBB44_8
.LBB44_8:
addq $104, %rsp
retq
.LBB44_9:
leaq str.0(%rip), %rdi
leaq .L__unnamed_7(%rip), %rdx
movq core::panicking::panic@GOTPCREL(%rip), %rax
movl $33, %esi
callq *%rax
ud2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With