Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does gcc perform tail call optimization for one version but not for the other?

Experimenting with tail call optimization (tco), I stumbled upon the following curious example:

unsigned long long int fac1(unsigned long long int n){
  if (n==0)
    return 1;
  return n*fac1(n-1);
}

actually, I was impressed, that gcc was able to perform the tco here (with -O2 flag), because it is not that straight forward:

fac1(unsigned long long):
        testq   %rdi, %rdi
        movl    $1, %eax
        je      .L4
.L3:
        imulq   %rdi, %rax
        subq    $1, %rdi
        jne     .L3
        rep ret
.L4:
        rep ret

However, after the change of the return type from unsigned long long int to unsigned int gcc was not able to perform tlo:

unsigned int fac2(unsigned long long int n){
  if (n==0)
    return 1;
  return n*fac2(n-1);
}

we can clearly see the recursive call in the resulting assembly:

fac2(unsigned long long):
        testq   %rdi, %rdi
        jne     .L16
        movl    $1, %eax
        ret
.L16:
        pushq   %rbx
        movq    %rdi, %rbx
        leaq    -1(%rdi), %rdi
        call    fac2(unsigned long long)
        imull   %ebx, %eax
        popq    %rbx
        ret

At first, I dismissed this as a missed optimization, but now I'm not that sure, because clang isn't able to perform this optimization as well. So maybe there are subtlities of the language I'm not aware of which prevent this optimization.

Why doesn't gcc perform the tail-call-optimization for the function fac2 but only for fac1?


It is clear to me, that in the second version, the result must be downcasted. Obviously this is the only difference. But why should this be a problem and prevent tlo?

For example, if I help the compiler and rewrite my function as a classic tail-recursion (which should produce identical results to version fac2):

unsigned int tlo_fac(unsigned long long int n, unsigned long long int cur){
  if (n==0)
    return cur;
  return tlo_fac(n-1, n*cur);
}

unsigned int fac(unsigned long long int n){
  return tlo_fac(n,1);
}

I get a tlo-optimized version which is identical to fac1 (the high 32bit are allowed to contain garbage, so imulq can be used after inlining):

fac(unsigned long long):
        testq   %rdi, %rdi
        movl    $1, %eax
        je      .L10
.L11:
        imulq   %rdi, %rax
        subq    $1, %rdi
        jne     .L11
.L10:
        rep ret
like image 937
ead Avatar asked Oct 25 '17 11:10

ead


People also ask

Does gcc do tail call optimization?

Regarding functions call optimization, gcc can do tail-call elimination to save the cost of allocating a new stack frame, and tail recursion elimination to turn a recursive function to non-recursive iterative one.

How does tail call optimization work?

Tail call optimization happens when the compiler transforms a call immediately followed by a ret into a single jmp . This transformation saves one instruction, and more importantly it eliminates the implicit push/pop from the stack done by call and ret .

Does V8 support tail call optimization?

Last notes. Tail-call optimization is a part of the ES2015-ES6 specification. Supporting it isn't a NodeJS thing, it's something the V8 engine that NodeJS uses needs to support.

How does compiler optimize tail recursion?

The idea used by compilers to optimize tail-recursive functions is simple, since the recursive call is the last statement, there is nothing left to do in the current function, so saving the current function's stack frame is of no use (See this for more details).


Video Answer


1 Answers

In fact2(), after recursion is completed a cast will be needed from unsigned long long int to unsigned int

unsigned int fac2(unsigned int n) produces the below assembly,

fac2(unsigned int):
        testl   %edi, %edi
        movl    $1, %eax
        je      .L10
.L9:
        imull   %edi, %eax
        subl    $1, %edi
        jne     .L9
        rep ret
.L10:
        rep ret
like image 56
Gaurav Sehgal Avatar answered Sep 28 '22 07:09

Gaurav Sehgal