Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Implement quicksort on bi-directional iterators

It seems quite straightforward to implement quicksort using bi-directional iterators with O(NlgN) time and O(lgN) space. So, what is the particular reason that std::sort() requires random-access iterators?

I have read about the topic why do std::sort and partial_sort require random-access iterators?. But it doesn't explain in what specific part of a possible std::sort() implementation that may actually require a random-access iterator to maintain its time and space complexity.

A possible implementation with O(NlgN) time and O(lgN) space:

template <typename BidirIt, typename Pred>
BidirIt partition(BidirIt first, BidirIt last, Pred pred) {
  while (true) {
    while (true) {
      if (first == last) return first;
      if (! pred(*first)) break;
      ++first;
    }
    while (true) {
      if (first == --last) return first;
      if (pred(*last)) break;
    }
    iter_swap(first, last);
    ++first;
  }
}

template <typename BidirIt, typename Less = std::less<void>>
void sort(BidirIt first, BidirIt last, Less&& less = Less{}) {
  using value_type = typename std::iterator_traits<BidirIt>::value_type;
  using pair = std::pair<BidirIt, BidirIt>;
  std::stack<pair> stk;
  stk.emplace(first, last);
  while (stk.size()) {
    std::tie(first, last) = stk.top();
    stk.pop();
    if (first == last) continue;
    auto prev_last = std::prev(last);
    auto pivot = *prev_last;
    auto mid = ::partition(first, prev_last,
      [=](const value_type& val) {
        return val < pivot;
      });
    std::iter_swap(mid, prev_last);
    stk.emplace(first, mid);
    stk.emplace(++mid, last);
  }
}
like image 959
Lingxi Avatar asked Mar 13 '23 10:03

Lingxi


1 Answers

There are a few reasons why practical library sort functions need random access iterators.

The most obvious one is the well-known fact that choosing an endpoint of the partition for a pivot reduces quicksort to O(n2) if the data is sorted (or "mostly sorted"), so most real-life quicksort's actually use a more robust algorithm. I think the most common one is the Wirth algorithm: choose the median of the first, middle, and last element of the partition, which is robust against sorted vectors. (As Dieter Kühl points out, just selecting the middle element would work almost as well, but there is practically no extra cost for the median-of-three algorithm.) Selecting a random element would also be a good strategy, since it is harder to game, but the requirement for a PRNG might be discouraging. Any strategy for selecting the pivot other than taking an endpoint requires random-access iterators (or a linear scan).

Second, quicksort is sub-optimal when the partition is small (for some heuristic definition of small). When there are few enough elements, the simplified loop of an insertion sort combined with locality of reference will make that a better solution. (This doesn't affect the complexity of the overall algorithm because the threshold is a fixed size; insertion sort of a maximum of k elements is O(1) for any previously established k. I think you'll typically find values between 10 and 30.) The insertion sort can be done with bidirectional iterators, but figuring out if the partition is smaller than the threshold cannot (again, unless you use an unnecessarily slow loop).

Third and possibly most importantly, quicksort can degenerate into O(n2) no matter how hard you try. Earlier C++ standards accepted that std::sort might be "O(n log n) on the average", but since the acceptance of DR713 the standard requires std::sort to be O(n log n) without qualifications. This cannot be achieved with a pure quicksort, so modern library sort algorithms are actually based on introsort or similar. This algorithm falls back to a different sorting algorithm -- normally heapsort -- if it detects that the partitioning is too biased. The fallback algorithm is very likely to require random access iterators (heapsort and shellsort both do, for example).

Finally, recursion depth can be reduced to a maximum of log2n by using the simple strategy of recursing on the smallest partition and tail-recurring (explicitly looping) on the larger partition. Since recursion is generally faster than explicitly maintaining a stack, and recursion is totally reasonable if the maximum recursion depth is in the low two digits, this little optimization is worthwhile (although not all library implementations use it.) Again, that requires being able to compute the size of the partitions.

There are probably other aspects of practical sortation which require random access iterators; those are just off the top of my head.

like image 172
rici Avatar answered Mar 19 '23 15:03

rici