Tail call optimisation seems to slightly worsen performance

Tags:

In a quicksort implementation, the data on left is for pure -O2 optimized code, and data on right is -O2 optimized code with -fno-optimize-sibling-calls flag turned on i.e with tail-call optimisation turned off. This is average of 3 different runs, variation seemed negligible. Values were of range 1-1000, time in millisecond. Compiler was MinGW g++, version 6.3.0.

size of array     with TLO(ms)    without TLO(ms)
      8M                35,083           34,051 
      4M                 8,952            8,627
      1M                   613              609

Below is my code:

#include <bits/stdc++.h>
using namespace std;

int N = 4000000;

void qsort(int* arr,int start=0,int finish=N-1){
    if(start>=finish) return ;
    int i=start+1,j = finish,temp;
    auto pivot = arr[start];
    while(i!=j){
        while (arr[j]>=pivot && j>i) --j;
        while (arr[i]<pivot && i<j) ++i;
        if(i==j) break;
        temp=arr[i];arr[i]=arr[j];arr[j]=temp; //swap big guy to right side
    }
    if(arr[i]>=arr[start]) --i;

    temp = arr[start];arr[start]=arr[i];arr[i]=temp; //swap pivot
    qsort(arr,start,i-1);
    qsort(arr,i+1,finish);
}

int main(){
    srand(time(NULL));
    int* arr = new int[N];
    for(int i=0;i<N;i++) {arr[i] = rand()%1000+1;}

    auto start = clock();
    qsort(arr);
    cout<<(clock()-start)<<endl;
    return 0;
}

I heard clock() isn't the perfect way to measure time. But this effect seems to be consistent.

EDIT: as response to a comment, I guess my question is : Explain how exactly gcc's tail-call optimizer works and how this happened and what should I do to leverage tail-call to speed up my program?

211

asked Oct 18 '17 14:10

Shihab Shahriar Khan

1 Answers

On speed:

As already pointed out in the comments, the primary goal of tail-call-optimization is to reduce the usage of the stack.

However, often there is a collateral: the program becomes faster because there is no overhead needed for a call of a function. This gain is most prominent if the work in the function itself is not that big, so the overhead has some weight.

If there is a lot of work done during a function call, the overhead can be neglected and there is no noticeable speed-up.

On the other hand, if tail call optimization is done, that means that potentially other optimization cannot be done, which could otherwise speed-up your code.

The case of your quick-sort is not that clear cut: There are some calls with a lot of workload and a lot of calls with a very small work load.

So, for 1M elements there are more disadvantages from tail-call-optimization as advantages. On my machine the tail-call-optimized function becomes faster than the non-optimized function for arrays smaller than 50000 elements.

I must confess, I cannot say, why this is the case alone from looking at the assembly. All I can understand, is that the resulting assemblies are pretty different and that the quicksort is really called once for the optimized version.

There is a clear cut example, for which tail-call-optimization is much faster (because there is not very much happening in the function itself and the overhead is noticeable):

//fib.cpp
#include <iostream>

unsigned long long int fib(unsigned long long int n){
  if (n==0 || n==1)
    return 1;
  return fib(n-1)+fib(n-2);
}

int main(){
  unsigned long long int N;
  std::cin >> N;
  std::cout << fib(N);
}

running time echo "40" | ./fib, I get 1.1 vs. 1.6 seconds for tail-call-optimized version vs. non-optimized version. Actually, I'm pretty impressed, that the compiler is able to use tail-call-optimization here - but it really does, as can be see at godbolt.org, - the second call of fib is optimized.

On tail call optimization:

Usually, tail-call optimization can be done if the recursion call is the last operation (prior to return) in the function - the variables on the stack can be reused for the next call, i.e. the function should be of the form

ResType f( InputType input){
    //do work
    InputType new_input = ...;
    return f(new_input);
}

There are some languages which don't do tail call optimization at all (e.g. python) and some for which you can explicitely ask the compiler to do it and the compiler will fail if it were not able to (e.g. clojure). c++ goes a way in beetween: the compiler tries its best (which is amazingly good!), but you have no guarantee it will succseed and if not, it silently falls to a version without tail-call-optimization.

Let's take look at this simple and standard implementation of tail call recursion:

//should be called fac(n,1)
unsigned long long int 
fac(unsigned long long int n, unsigned long long int res_so_far){
  if (n==0)
    return res_so_far;
  return fac(n-1, res_so_far*n);
}

This classical form of tail-call makes it easy for compiler to optimize: see result here - no recursive call to fac!

However, the gcc compiler is able to perform the TCO also for less obvious cases:

unsigned long long int 
fac(unsigned long long int n){
  if (n==0)
    return 1;
  return n*fac(n-1);
}

It is easier to read and write for us humans, but harder to optimize for compiler (fun fact: TCO is not performed if the return type would be int instead of unsigned long long int): after all the result from the recursive call is used for further calculations (multiplication) before it is returned. But gcc manages to perform TCO here as well!

At hand of this example, we can see the result of TCO at work:

//factorial.cpp
#include <iostream>

unsigned long long int 
fac(unsigned long long int n){
  if (n==0)
    return 1;
  return n*fac(n-1);
}

int main(){
  unsigned long long int N;
  std::cin >> N;
  std::cout << fac(N);
}

Running time echo "40000000" | ./factorial will get you the result (0) in no time if the tail-call-optimization was on, or "Segmentation fault" otherwise - because of the stack-overflow due to recursion depth.

Actually it is a simple test to see whether the tail-call-optimization was performed or not: "Segmentation fault" for non-optimized version and large recursion depth.

Corollary:

As already pointed out in the comments: Only the second call of the quick-sort is optimized via TLO. In you implementation, if you are unlucky and the second half of the array always consist of only one element you will need O(n) space on the stack.

However, if the first call would be always with the smaller half and the second call with the larger half were TLO, you would need at most O(log n) recursion depth and thus only O(log n) space on the stack.

That means you should check for which part of the array you call the quicksort first as it plays a huge role.

answered Oct 23 '22 04:10

ead

Related questions
                            
                                Can an lvalue at end of scope be treated as an rvalue?
                            
                                The correct way to pass a string from Fortran to C/C++
                            
                                Why does LLVM choose open-addressing hash table to implement llvm::StringMap?
                            
                                Can I retrieve the source code of the library packaged with Conan package manager to be able to debug in it?
                            
                                Overload resolution between template members in base and derived classes
                            
                                copy & copy_if vs remove_copy & remove_copy_if
                            
                                Comparison for objects derived from std::string_view is ambiguous in MSVC
                            
                                C++11 default class member initialization with initializer list , simultaneously
                            
                                Efficient (on Ryzen) way to extract the odd elements of a __m256 into a __m128?
                            
                                C++ Preprocessor doesn't process define as expected
                            
                                Convert time-string to std::time_t using std::get_time: wrong result
                            
                                Set a scale for a background-image
                            
                                Code optimization subset sum
                            
                                Trivial default constructor can't be constexpr?
                            
                                If std::greater<>, then why std::less (and not std::lesser<>)?
                            
                                Why can I pass a value-accepting callable to a reference-accepting std::function?
                            
                                Fixed point combinator and explicit result type
                            
                                private static const member variable in header vs const variable in cpp
                            
                                Recording Video in UYVY codec in Opencv
                            
                                Difference between fstream, ofstream, ostream, iostream

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Tail call optimisation seems to slightly worsen performance

Tags:

c++

algorithm

optimization

compiler-optimization

g++

Shihab Shahriar Khan

People also ask

1 Answers

ead

Recent Activity

Donate For Us