I have an array of elements. This array could be: <ul> <li>Randomly shuffled (about 20% of the time)</li> <li>Nearly sorted* in ascending order (about 40% of the time)</li> <li>Nearly sorted in descending order (about 40% of the time) </li> </ul> But I do not know (in advance) which of these cases applies. I would prefer to sort the array into the order which it is already close to. It does not matter whether the output is ascending or descending, but it must be one or the other (so I can perform a binary search on it.) The sort need not be stable. <hr> Some background info: The process goes roughly like this: <ul> <li>Populate the array</li> <li>Sort on some attribute A</li> <li>Do some processing (compute quantiles, and some other minor stuff)</li> <li>Sort on some other attribute B</li> <li>Do more processing</li> <li>Sort on attribute C</li> <li>Do more processing</li> </ul> A and B are often correlated with each other (but may be positively or negatively.) Same applies to B and C. Occasionally A == C. * "nearly sorted" here means most elements are close to their final positions. But rarely exactly at their final positions (there is a lot of additive noise, and not many long sorted subsequences.) Still, there are usually a few "outliers" at the start and end of the array which are poor predictors of the order for the next sort. <hr> Is there an algorithm that can advantage of the fact that I have no preference for ascending vs. descending, to sort more cheaply (compared to the TimSort I am currently using?)

Another possibility to consider: <ul> <li>Start doing a (hand-rolled) insertion sort</li> <li>As you go, count the number of inversions you perform</li> <li>After you have done some small fixed number of insertions, compare the number of inversions that you have counted, to the maximum number of inversions that would have occurred by that point if the data were reverse-sorted to begin with:</li> <li>If the proportion is close to 0, then (probably) the data is nearly-sorted. Complete the insertion sort, which performs very well on nearly-sorted data. If you don't like the sound of "probably" then continue counting inversions as you go and be ready to fall back to Timsort if it falls under a threshold.</li> <li>If the proportion is close to 1, then (probably) the data is nearly-reverse-sorted, and you have a small number of sorted elements at the start. Move them to the end, reverse them, and complete an insertion sort with reversed comparator.</li> <li>Otherwise the data is random, use your favourite sorting algorithm. I'd say Timsort, but since that does well on nearly-sorted data there must be some other algorithm that does at least a tiny bit better than Timsort does on uniformly-shuffled data. Probably plain merge sort without the Tim.</li> </ul> The "small fixed number" can be a number for which insertion sort is fairly fast even in bad cases. I would guess 10-20 or so. It's possible to work out the probability of a false positive in uniformly shuffled data for any given number of insertions and any given threshold of "close to 0/1", but I'm too lazy. You say the first and last few array elements typically buck the trend, in which case you could exclude them from the initial test insertion sort. Obviously this approach is somewhat inspired by Timsort. But Timsort is fiendishly optimized for data that contains runs -- I have tried to fiendishly optimize only for data that's close to one big run (in either direction). Another feature of Timsort is that it's well tested, I don't claim to share that.

Sort in ascending or descending order (chosen arbitrarily; Prefer whichever is cheaper)

Tags:

algorithm

sorting

I have an array of elements. This array could be:

Randomly shuffled (about 20% of the time)
Nearly sorted* in ascending order (about 40% of the time)
Nearly sorted in descending order (about 40% of the time)

But I do not know (in advance) which of these cases applies. I would prefer to sort the array into the order which it is already close to.

It does not matter whether the output is ascending or descending, but it must be one or the other (so I can perform a binary search on it.)

The sort need not be stable.

Some background info: The process goes roughly like this:

Populate the array
Sort on some attribute A
Do some processing (compute quantiles, and some other minor stuff)
Sort on some other attribute B
Do more processing
Sort on attribute C
Do more processing

A and B are often correlated with each other (but may be positively or negatively.) Same applies to B and C. Occasionally A == C.

* "nearly sorted" here means most elements are close to their final positions. But rarely exactly at their final positions (there is a lot of additive noise, and not many long sorted subsequences.) Still, there are usually a few "outliers" at the start and end of the array which are poor predictors of the order for the next sort.

Is there an algorithm that can advantage of the fact that I have no preference for ascending vs. descending, to sort more cheaply (compared to the TimSort I am currently using?)

971

asked Nov 03 '12 23:11

finnw

2 Answers

I'd continue using Timsort (however, a good alternative is Smoothsort^*), but first probe the array to decide whether to sort in ascending or descending order. Look at the first and last elements and sort accordingly. If the array is unsorted, the choice is immaterial; if it is (partially) sorted, probing at a wide interval is more likely to correctly detect which way.

^*Smoothsort has the same best, average, and worst case time as Timsort, and better space complexity. Like Timsort, it was specifically designed to take advantage of partially sorted data.

answered Oct 29 '22 17:10

Ted Hopp

Another possibility to consider:

Start doing a (hand-rolled) insertion sort
As you go, count the number of inversions you perform
After you have done some small fixed number of insertions, compare the number of inversions that you have counted, to the maximum number of inversions that would have occurred by that point if the data were reverse-sorted to begin with:
If the proportion is close to 0, then (probably) the data is nearly-sorted. Complete the insertion sort, which performs very well on nearly-sorted data. If you don't like the sound of "probably" then continue counting inversions as you go and be ready to fall back to Timsort if it falls under a threshold.
If the proportion is close to 1, then (probably) the data is nearly-reverse-sorted, and you have a small number of sorted elements at the start. Move them to the end, reverse them, and complete an insertion sort with reversed comparator.
Otherwise the data is random, use your favourite sorting algorithm. I'd say Timsort, but since that does well on nearly-sorted data there must be some other algorithm that does at least a tiny bit better than Timsort does on uniformly-shuffled data. Probably plain merge sort without the Tim.

The "small fixed number" can be a number for which insertion sort is fairly fast even in bad cases. I would guess 10-20 or so. It's possible to work out the probability of a false positive in uniformly shuffled data for any given number of insertions and any given threshold of "close to 0/1", but I'm too lazy.

You say the first and last few array elements typically buck the trend, in which case you could exclude them from the initial test insertion sort.

Obviously this approach is somewhat inspired by Timsort. But Timsort is fiendishly optimized for data that contains runs -- I have tried to fiendishly optimize only for data that's close to one big run (in either direction). Another feature of Timsort is that it's well tested, I don't claim to share that.

answered Oct 29 '22 17:10

Steve Jessop

Related questions
                            
                                Finding all intervals (overlapping and nonoverlapping) in overlapping intervals
                            
                                Importance of Algorithms in context of Mobile Application Development? [closed]
                            
                                Algorithm to see if regex repeat is reducible
                            
                                How to merge two finite state automata?
                            
                                Algorithm to find rectangles
                            
                                Minimizing distance to a weighted grid
                            
                                Solve the word game Ghost (as seen on xkcd) - spelling letters without making a word
                            
                                Adjusting the threshold in Canny edge algorithm
                            
                                Merging sequence of symbols
                            
                                finding saddle points in 3d heightmap
                            
                                Finding the minimum unique number in an array
                            
                                what is meant by symmetric DDA?
                            
                                What is this pattern/algo called? Getting a random order of subscribers to an event that only one can react to at a time
                            
                                Longest Common Palindromic Subsequence
                            
                                Discrete fluid "filling" algorithm for a height map
                            
                                Algorithm to interleave array of characters and digits in-place
                            
                                Print all the files in a given folder and sub-folders without using recursion/stack
                            
                                Find a supplement to a subarray of ints in Java
                            
                                Shortest path to transform one word into another
                            
                                What are strongly connected components used for?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With