Why is Boost's QuickSort slower compared to Julia's QuickSort?

Question

I am comparing the performance between Julia and C++. Then I found quick sort is much faster in Julia (and even faster than C++), especially when the size of array is very big.

Can anyone explain the reasons?

quickSort.jl

include("../dimension.jl")
function execute()  
    n = getDimension()
    print(stderr, "Julia,quickSort_optim,$n,");  # use default delimiter
    arr = zeros(Int32, n)
    for i = 1:n
        arr[i] = (777*(i-1)) % 10000
    end
    if n > 0
        sort!(arr; alg=QuickSort)
    end
end

# executing ...
execute()

quickSort_boost.cpp

#include "dimension.h" 
#include <boost/lambda/lambda.hpp>
#include <boost/sort/pdqsort/pdqsort.hpp>
#include <iostream>
#include <iterator>
#include <algorithm>
 
using namespace std;
using namespace boost::sort; 

int main()
{
    int n = getDimension();
    cerr << "C++,quickSort_boost," << n << ",";
    
    vector<int> arr(n);
    unsigned long long w;
    for(int i = 0; i < n; ++i){  // array for sorting
        w = (777*i) % 10000; // Array with values between 0 and 10000
        arr[i] = w;
    }
    
    if (n > 0){
        pdqsort_branchless(arr.begin(), arr.end(), [](const int &a, const int &b){return ( a < b );});
    }
    
    return 0;
}

Comparison

Performance chart plotting size of array vs runtime

Note

The function getDimension() is used to get the array size.

The execution time is measured with the shell command: /usr/bin/time under Ubuntu. Compiler: clang version 6.0.0-1ubuntu2. Optimization level: -02. CPU: Intel i7-3820QM

The reason that I compared the whole execution time rather than just the algorithm itself is because I want to compare the performance between these two languages, which simulates a real application scenario.

In Julia official document, it writes: QuickSort: good performance for large collections. Is this because Julia uses a special implementation inside the algorithm.

More samples

I run the test with more samples. It seems that the distribution of the data is the issue.

best case with broadly distributed data:

function execute()  # julia code segment for specifying data
    for i = 1:n
        arr[i] = i
    end

for(int i = 0; i < n; ++i){  // c++ boost code segment for specifying data
    arr[i] =  i + 1; 
}

QuickSort-best case with broadly distruted data

worst case with broadly distributed data:

function execute()  # julia code segment for specifying data
    for i = 1:n
        arr[i] = n - i + 1
    end

for(int i = 0; i < n; ++i){  // c++ boost code segment for specifying data
    arr[i] =  n - i;  
}

QuickSort-worst case with broadly distruted data

Concentrated distributed data

function execute()  # julia code segment for specifying data
    for i = 1:n
        arr[i] = i % 10
    end

for(int i = 0; i < n; ++i){  // c++ boost code segment for specifying data
    arr[i] =  (i + 1) % 10;  
}

Concentrated distributed data

My source code

StefanKarpinski · Accepted Answer

Not sure what's up with the timings – you don't include enough code to test or reproduce. The Julia code for QuickSort is pretty simple, you can see the source for it here:

https://github.com/JuliaLang/julia/blob/77487611fd9751c0f31ac3853072e6ca11efaf05/base/sort.jl#L518-L579

I'll include the code for easy reading here as well:

@inline function selectpivot!(v::AbstractVector, lo::Integer, hi::Integer, o::Ordering)
    @inbounds begin
        mi = midpoint(lo, hi)

        # sort v[mi] <= v[lo] <= v[hi] such that the pivot is immediately in place
        if lt(o, v[lo], v[mi])
            v[mi], v[lo] = v[lo], v[mi]
        end

        if lt(o, v[hi], v[lo])
            if lt(o, v[hi], v[mi])
                v[hi], v[lo], v[mi] = v[lo], v[mi], v[hi]
            else
                v[hi], v[lo] = v[lo], v[hi]
            end
        end

        # return the pivot
        return v[lo]
    end
end

function partition!(v::AbstractVector, lo::Integer, hi::Integer, o::Ordering)
    pivot = selectpivot!(v, lo, hi, o)
    # pivot == v[lo], v[hi] > pivot
    i, j = lo, hi
    @inbounds while true
        i += 1; j -= 1
        while lt(o, v[i], pivot); i += 1; end;
        while lt(o, pivot, v[j]); j -= 1; end;
        i >= j && break
        v[i], v[j] = v[j], v[i]
    end
    v[j], v[lo] = pivot, v[j]

    # v[j] == pivot
    # v[k] >= pivot for k > j
    # v[i] <= pivot for i < j
    return j
end

function sort!(v::AbstractVector, lo::Integer, hi::Integer, a::QuickSortAlg, o::Ordering)
    @inbounds while lo < hi
        hi-lo <= SMALL_THRESHOLD && return sort!(v, lo, hi, SMALL_ALGORITHM, o)
        j = partition!(v, lo, hi, o)
        if j-lo < hi-j
            # recurse on the smaller chunk
            # this is necessary to preserve O(log(n))
            # stack space in the worst case (rather than O(n))
            lo < (j-1) && sort!(v, lo, j-1, a, o)
            lo = j+1
        else
            j+1 < hi && sort!(v, j+1, hi, a, o)
            hi = j-1
        end
    end
    return v
end

Nothing special going on there, just a basic well-optimized quicksort. So the real question is why the C++/Boost version is slow, which I'm not in a good position to answer. Perhaps one of the many C++ experts hanging around here can address that.

It seems like it's possible that C++ does not choose pivots as well as Julia does. Pivot selection is a bit of a balancing act: on the one hand, if you pick bad pivots the asymptotic behavior can become really bad; on the other hand, if you spend too much time and effort picking pivots, it can slow the whole sort down considerably.

Another possibility is that Boost's quicksort implementation doesn't use a different algorithm as a base case once the array size gets smaller. That is pretty key for good performance since quicksort doesn't perform well on small arrays. The Julia version switches to insertion sort for arrays of less than 20 elements.

Unapiedra · Answer

Summary

TL;DR Julia and C++'s std::sort are extremely close in performance. Boost's pdqsort_branchless is faster.

Key Take-Aways

Methodology:

Only measure what is important. In your code you are also measuring set-up time to generate the vectors.
Be careful that the distribution of your test data is comparable to the distribution of your "production" data.
Make sure you compile with -march=native -mtune=native ref

Code:

Boost's pdqsort_branchless is faster. The STL's std::sort and Julia's sort are reasonably close together in runtime performance.
I also tested C's qsort which is slower. That's why I don't show it any further.
I also used the BenchmarkTools Julia package to benchmark Julia's sort.

enter image description here

Plot explanation:

I use this script for plotting.
If labels are too small:
- blue/fastest: boost's sort
- green and red: std::sort and Julia.

Benchmarking

C++

On C++, use the benchmark library.

For your convenience, you can access it online at Quick-Bench. Here is this code: https://quick-bench.com/q/CVu8y1fjwh19AHJH1iUCogxG6Pk

This lets you specify which part of your code you want to measure. It will also do the measuring and the statistics correctly.

Here's an example:

static void Example(benchmark::State &state) {
  std::vector<int> data(1024);
  std::iota(data.begin(), data.end(), 0);

  std::mt19937 mersenne_engine{1234};

  for (auto _ : state) {
    state.PauseTiming();
    std::shuffle(data.begin(), data.end(), mersenne_engine);
    state.ResumeTiming();

    std::sort(data.begin(), data.end());
  }
}
BENCHMARK(Example);

Google Benchmark only measures code inside the for (auto _ : state) {...} block (the timed block). So, the code above avoids the pitfall of measuring setup time.

Before each run, the data is shuffled.

Julia

Use the BenchmarkTools library, where the benchmark is done like this:

Random.seed!(1234)
function make_vec(n)
    return Random.randperm(Int32(n));
end

@benchmark sort!(arr; alg=QuickSort) setup=(arr=make_vec(1024));

Data Generation

We need to clarify the way you generate data.

As you discover, the algorithms work fastest on already sorted data and worse on reverse-sorted data.

In the end, the domain knowledge of where the benchmarked algorithm is run needs to be known. What is the shape of the data distribution?

Random Data

First, let's just have uniformly distributed data in [0, 10 000]. Much better than testing on already sorted data.

I am using a fixed seed, so the data will be the same every time (for a given N). If you instantiate mersenne_engine with rnd_device(), you get a different seed every time, and thus differing data. Personally, I favor going with random but consistent data, to make the result less variable. Just keep it in mind for your later analysis.

Note that generating data between 0 and 10k will lead to many duplicates on inputs order of magnitudes greater than 10k.

Shuffled Data

If you want don't want to have duplicates, then generate a vector with elements [0, 1, 2, ..., N] and then shuffle the data.

Here's how I'd do it:

  // Create vector filled with 0..n
  std::vector<int> arr(n);
  std::iota(arr.begin(), arr.end(), 0);
  std::shuffle(std::begin(arr), std::end(arr), mersenne_engine);
}

Julia:

function make_vec(n)
    return Random.randperm(n);
end

Why is Boost's QuickSort slower compared to Julia's QuickSort?

Tags:

c++

performance

benchmarking

quicksort

julia

David

2 Answers

StefanKarpinski

Summary

Key Take-Aways

Benchmarking

C++

Julia

Data Generation

Random Data

Shuffled Data

Unapiedra

Recent Activity

Donate For Us

Why is Boost's QuickSort slower compared to Julia's QuickSort?

Tags:

c++

performance

benchmarking

quicksort

julia

David

2 Answers

StefanKarpinski

Summary

Key Take-Aways

Benchmarking

C++

Julia

Data Generation

Random Data

Shuffled Data

Unapiedra

Related questions

Recent Activity

Donate For Us