Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

A good sorting algorithm for mostly-sorted data that doesn't all fit into memory? [closed]

In case you are given:

  • certain amount of data
  • memory with size half of the data size
  • part of the data is sorted
  • you do not know the size of the sorted data.

Which sorting algorithm would you choose? I am debating between insertion and quicksort. I know that the best case for insertion sort is O(n), but the worst case is O(n2). Also, considering the fact the memory is limited, I would divide the data in two parts, and on each of them do quicksort, then merge everything together. It will take O(n) time to split the data, O(n) to merge the data, and O(n log n) to sort the data using quicksort, for a net runtime of O(n log n).

Does anyone have any suggestions on how to improve this?

like image 896
FranXh Avatar asked Feb 29 '12 03:02

FranXh


People also ask

Which sort algorithm works best on mostly sorted data?

Insertion sort is the clear winner on this initial condition. Bubble sort is fast, but insertion sort has lower overhead. Shell sort is fast because it is based on insertion sort. Merge sort, heap sort, and quick sort do not adapt to nearly sorted data.

Which sorting algorithm is best suitable for low memory system?

Some basic algorithms like Insertion or Bubble sort require no additional memory and can sort the data in place. On the other hand, more efficient algorithms like Quick sort and Merge sort require O(logN) and O(N) time complexity respectively (meaning that extra space is required to complete the sorting).

Which sorting algorithm is best when array is already sorted?

Merge Sort Merge Sort is a divide-and-conquer algorithm. In each iteration, merge sort divides the input array into two equal subarrays, calls itself recursively for the two subarrays, and finally merges the two sorted halves.

Which sort is better when sorting a list that is already sorted?

If the input array is already sorted, insertion sort performs as few as n-1 comparisons, thus making insertion sort more efficient when given sorted or "nearly-sorted" arrays.


2 Answers

Your mergesort-like approach seems very reasonable. More generally, this type of sorting algorithm is called an external sorting algorithm. These algorithms often work as you've described - load some subset of the data into memory, sort it, then write it back out to disk. At the end, use a merging algorithm to merge everything back together. The choice of how much to load in and what sorting algorithm to use are usually the dominant concerns. I'll focus mostly on the sorting algorithm choice.

Your concerns about quicksort's worst-case behavior are generally speaking nothing to worry about, since if you choose the pivot randomly the probability that you get a really bad runtime is low. The random pivot strategy also works well even if the data is already sorted, as it has no worst-case inputs (unless someone knows your random number generator and the seed). You could also use a quicksort variant like introsort, which doesn't have the worst-case behavior, as your sorting algorithm in order to avoid this worst-case.

That said, since you know that the data is already partially sorted, you may want to look into an adaptive sorting algorithm for your sorting step. You've mentioned insertion sort for this, but there are much better adaptive algorithms out there. If memory is scarce (as you've described), you might want to try looking into the smoothsort algorithm, which has best-case runtime O(n), worst-case runtime O(n log n), and uses only O(1) memory. It's not as adaptive as some other algorithms (like Python's timsort, natural mergesort, or Cartesian tree sort), but it has lower memory usage. It's also not as fast as a good quicksort, but if the data really is mostly sorted it can do pretty well.

Hope this helps!

like image 134
templatetypedef Avatar answered Nov 14 '22 23:11

templatetypedef


On the face of it, I would divide & conquer with quicksort and call it a day. Many algorithms problems are over-thought.

Now, if you do have have test data to work with and really want a grasp on that, stick an abstract class in the middle and benchmark. We can hem and haw over things all day, but knowing that the data is already partially sorted, you'll have to test. Sorted data brings about worst-case performance in most quicksort implementations.

Consider that there are many sorting algorithms and some are suited better to sorted sets. Also, when you know a set is sorted, you can merge it with another set in n time. Thus, identifying chunks of sorted data first might save you a lot of time when you compare adding a single (n) pass and greatly reducing the chance of quicksort going to (n2) time.

like image 21
Jeff Ferland Avatar answered Nov 15 '22 00:11

Jeff Ferland