Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sort in ascending or descending order (chosen arbitrarily; Prefer whichever is cheaper)

I have an array of elements. This array could be:

  • Randomly shuffled (about 20% of the time)
  • Nearly sorted* in ascending order (about 40% of the time)
  • Nearly sorted in descending order (about 40% of the time)

But I do not know (in advance) which of these cases applies. I would prefer to sort the array into the order which it is already close to.

It does not matter whether the output is ascending or descending, but it must be one or the other (so I can perform a binary search on it.)

The sort need not be stable.


Some background info: The process goes roughly like this:

  • Populate the array
  • Sort on some attribute A
  • Do some processing (compute quantiles, and some other minor stuff)
  • Sort on some other attribute B
  • Do more processing
  • Sort on attribute C
  • Do more processing

A and B are often correlated with each other (but may be positively or negatively.) Same applies to B and C. Occasionally A == C.

* "nearly sorted" here means most elements are close to their final positions. But rarely exactly at their final positions (there is a lot of additive noise, and not many long sorted subsequences.) Still, there are usually a few "outliers" at the start and end of the array which are poor predictors of the order for the next sort. 


Is there an algorithm that can advantage of the fact that I have no preference for ascending vs. descending, to sort more cheaply (compared to the TimSort I am currently using?)

like image 971
finnw Avatar asked Nov 03 '12 23:11

finnw


People also ask

What is descending order and ascending order?

Descending can also be thought of as climbing down the stairs of numbers starting from the highest value. Moving down the slide is descending. The opposite of descending order is known as ascending order, in which the numbers are arranged from lower value to higher value.

Which sorting is best for ascending order?

Selection Sort This sorting algorithm sorts an array by repeatedly finding the minimum element (considering ascending order) from the unsorted part and putting it at the beginning.

Which sorting algorithm is best for descending order?

Among the classical sorting algorithms, heap sort will do well when the input turns out to be (almost) sorted in descending order, as then the max-heap construction phase will involve (almost) no swaps (while the most swaps would occur when the input was already sorted in ascending order).


2 Answers

I'd continue using Timsort (however, a good alternative is Smoothsort*), but first probe the array to decide whether to sort in ascending or descending order. Look at the first and last elements and sort accordingly. If the array is unsorted, the choice is immaterial; if it is (partially) sorted, probing at a wide interval is more likely to correctly detect which way.

*Smoothsort has the same best, average, and worst case time as Timsort, and better space complexity. Like Timsort, it was specifically designed to take advantage of partially sorted data.

like image 62
Ted Hopp Avatar answered Oct 29 '22 17:10

Ted Hopp


Another possibility to consider:

  • Start doing a (hand-rolled) insertion sort
  • As you go, count the number of inversions you perform
  • After you have done some small fixed number of insertions, compare the number of inversions that you have counted, to the maximum number of inversions that would have occurred by that point if the data were reverse-sorted to begin with:
  • If the proportion is close to 0, then (probably) the data is nearly-sorted. Complete the insertion sort, which performs very well on nearly-sorted data. If you don't like the sound of "probably" then continue counting inversions as you go and be ready to fall back to Timsort if it falls under a threshold.
  • If the proportion is close to 1, then (probably) the data is nearly-reverse-sorted, and you have a small number of sorted elements at the start. Move them to the end, reverse them, and complete an insertion sort with reversed comparator.
  • Otherwise the data is random, use your favourite sorting algorithm. I'd say Timsort, but since that does well on nearly-sorted data there must be some other algorithm that does at least a tiny bit better than Timsort does on uniformly-shuffled data. Probably plain merge sort without the Tim.

The "small fixed number" can be a number for which insertion sort is fairly fast even in bad cases. I would guess 10-20 or so. It's possible to work out the probability of a false positive in uniformly shuffled data for any given number of insertions and any given threshold of "close to 0/1", but I'm too lazy.

You say the first and last few array elements typically buck the trend, in which case you could exclude them from the initial test insertion sort.

Obviously this approach is somewhat inspired by Timsort. But Timsort is fiendishly optimized for data that contains runs -- I have tried to fiendishly optimize only for data that's close to one big run (in either direction). Another feature of Timsort is that it's well tested, I don't claim to share that.

like image 27
Steve Jessop Avatar answered Oct 29 '22 17:10

Steve Jessop