Why does gcc perform tail call optimization for one version but not for the other?

Tags:

Experimenting with tail call optimization (tco), I stumbled upon the following curious example:

unsigned long long int fac1(unsigned long long int n){
  if (n==0)
    return 1;
  return n*fac1(n-1);
}

actually, I was impressed, that gcc was able to perform the tco here (with -O2 flag), because it is not that straight forward:

fac1(unsigned long long):
        testq   %rdi, %rdi
        movl    $1, %eax
        je      .L4
.L3:
        imulq   %rdi, %rax
        subq    $1, %rdi
        jne     .L3
        rep ret
.L4:
        rep ret

However, after the change of the return type from unsigned long long int to unsigned int gcc was not able to perform tlo:

unsigned int fac2(unsigned long long int n){
  if (n==0)
    return 1;
  return n*fac2(n-1);
}

we can clearly see the recursive call in the resulting assembly:

fac2(unsigned long long):
        testq   %rdi, %rdi
        jne     .L16
        movl    $1, %eax
        ret
.L16:
        pushq   %rbx
        movq    %rdi, %rbx
        leaq    -1(%rdi), %rdi
        call    fac2(unsigned long long)
        imull   %ebx, %eax
        popq    %rbx
        ret

At first, I dismissed this as a missed optimization, but now I'm not that sure, because clang isn't able to perform this optimization as well. So maybe there are subtlities of the language I'm not aware of which prevent this optimization.

Why doesn't gcc perform the tail-call-optimization for the function fac2 but only for fac1?

It is clear to me, that in the second version, the result must be downcasted. Obviously this is the only difference. But why should this be a problem and prevent tlo?

For example, if I help the compiler and rewrite my function as a classic tail-recursion (which should produce identical results to version fac2):

unsigned int tlo_fac(unsigned long long int n, unsigned long long int cur){
  if (n==0)
    return cur;
  return tlo_fac(n-1, n*cur);
}

unsigned int fac(unsigned long long int n){
  return tlo_fac(n,1);
}

I get a tlo-optimized version which is identical to fac1 (the high 32bit are allowed to contain garbage, so imulq can be used after inlining):

fac(unsigned long long):
        testq   %rdi, %rdi
        movl    $1, %eax
        je      .L10
.L11:
        imulq   %rdi, %rax
        subq    $1, %rdi
        jne     .L11
.L10:
        rep ret

937

asked Oct 25 '17 11:10

ead

Video Answer

1 Answers

In fact2(), after recursion is completed a cast will be needed from unsigned long long int to unsigned int

unsigned int fac2(unsigned int n) produces the below assembly,

fac2(unsigned int):
        testl   %edi, %edi
        movl    $1, %eax
        je      .L10
.L9:
        imull   %edi, %eax
        subl    $1, %edi
        jne     .L9
        rep ret
.L10:
        rep ret

answered Sep 28 '22 07:09

Gaurav Sehgal

Related questions
                            
                                Placement-new an STL container and destroying it safely afterwards
                            
                                Why is std::function itself a copy-constructible type?
                            
                                captureScreen in cocos-2d C++ - how to use or save the captured image?
                            
                                Boost ASIO: Asynchronous write, synchronously
                            
                                Variadic template issue
                            
                                QList of a polymorphic class with copy-on-write?
                            
                                Matrix multiplication very slow in Eigen
                            
                                minimize byte wasted to align data between 2 headers (custom allocator)
                            
                                How to multicast with ipv6 udp socket in C/C++ on linux?
                            
                                Delegating copy constructor and const data initialization
                            
                                "Include What you use"
                            
                                CMake identifies MSVC (both C and CXX), but still throws 'No CMAKE_*_COMPILER found'
                            
                                How to deal with dynamic allocation when implementing list of objects?
                            
                                Is it possible to use placement-new to change the type of a polymorphic object?
                            
                                CMake custom target name conflicts between subdirectories
                            
                                C++ scoped enums that are implicitly convertable to integers
                            
                                gcc std::regex with -fpack-struct seg faults
                            
                                How to safely translate tree data structures between C++ / Ocaml?
                            
                                Why a new signal for socket::readyRead() is executed, even when its earlier slot is still processing?
                            
                                Using tcmalloc in my C++ project

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why does gcc perform tail call optimization for one version but not for the other?

Tags:

c++

optimization

gcc

tail-call-optimization

ead

People also ask

Video Answer

1 Answers

Gaurav Sehgal

Recent Activity

Donate For Us