A good sorting algorithm for mostly-sorted data that doesn't all fit into memory? [closed]

Tags:

In case you are given:

certain amount of data
memory with size half of the data size
part of the data is sorted
you do not know the size of the sorted data.

Which sorting algorithm would you choose? I am debating between insertion and quicksort. I know that the best case for insertion sort is O(n), but the worst case is O(n²). Also, considering the fact the memory is limited, I would divide the data in two parts, and on each of them do quicksort, then merge everything together. It will take O(n) time to split the data, O(n) to merge the data, and O(n log n) to sort the data using quicksort, for a net runtime of O(n log n).

Does anyone have any suggestions on how to improve this?

896

asked Feb 29 '12 03:02

FranXh

2 Answers

Your mergesort-like approach seems very reasonable. More generally, this type of sorting algorithm is called an external sorting algorithm. These algorithms often work as you've described - load some subset of the data into memory, sort it, then write it back out to disk. At the end, use a merging algorithm to merge everything back together. The choice of how much to load in and what sorting algorithm to use are usually the dominant concerns. I'll focus mostly on the sorting algorithm choice.

Your concerns about quicksort's worst-case behavior are generally speaking nothing to worry about, since if you choose the pivot randomly the probability that you get a really bad runtime is low. The random pivot strategy also works well even if the data is already sorted, as it has no worst-case inputs (unless someone knows your random number generator and the seed). You could also use a quicksort variant like introsort, which doesn't have the worst-case behavior, as your sorting algorithm in order to avoid this worst-case.

That said, since you know that the data is already partially sorted, you may want to look into an adaptive sorting algorithm for your sorting step. You've mentioned insertion sort for this, but there are much better adaptive algorithms out there. If memory is scarce (as you've described), you might want to try looking into the smoothsort algorithm, which has best-case runtime O(n), worst-case runtime O(n log n), and uses only O(1) memory. It's not as adaptive as some other algorithms (like Python's timsort, natural mergesort, or Cartesian tree sort), but it has lower memory usage. It's also not as fast as a good quicksort, but if the data really is mostly sorted it can do pretty well.

Hope this helps!

134

answered Nov 14 '22 23:11

templatetypedef

On the face of it, I would divide & conquer with quicksort and call it a day. Many algorithms problems are over-thought.

Now, if you do have have test data to work with and really want a grasp on that, stick an abstract class in the middle and benchmark. We can hem and haw over things all day, but knowing that the data is already partially sorted, you'll have to test. Sorted data brings about worst-case performance in most quicksort implementations.

Consider that there are many sorting algorithms and some are suited better to sorted sets. Also, when you know a set is sorted, you can merge it with another set in n time. Thus, identifying chunks of sorted data first might save you a lot of time when you compare adding a single (n) pass and greatly reducing the chance of quicksort going to (n²) time.

answered Nov 15 '22 00:11

Jeff Ferland

Related questions
                            
                                EJB vs CDI and the "Entity Boundary Control" pattern
                            
                                In Hadoop where does the framework save the output of the Map task in a normal Map-Reduce Application?
                            
                                what is java.lang.UnsupportedClassVersionError?
                            
                                Distributed OSGi - what is the proper way to manage bundles across all containers?
                            
                                How to get generated keys with commons dbutils?
                            
                                Debug ServletContextListener.contextDestroyed() by setting the breaking point in eclipse
                            
                                Class Dimension for java on android
                            
                                Pass document as parameter to XSL Translation in Java
                            
                                How to get EOL character of any file in java
                            
                                java SwingWorker.doInBackground() must not access GUI elements
                            
                                How to create an InputStream from an array of strings
                            
                                How to format an Integer to a four-zero-left string?
                            
                                Java Scope: Returning an object instantiated inside a method - Is it dangerous?
                            
                                How to output the content of a Scenegraph in JavaFX 2 to an Image
                            
                                Bison java examples
                            
                                rotating coordinate plane for data and text in Java
                            
                                Error: unchecked call to DefaultComboBoxModel(E[])
                            
                                Java abstract/extends issue
                            
                                Spring - How to use BeanPropertyRowMapper without matching column names
                            
                                How to implement user types for @FindBy annotation?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

A good sorting algorithm for mostly-sorted data that doesn't all fit into memory? [closed]

Tags:

java

algorithm

sorting

data-structures

space-complexity

FranXh

People also ask

2 Answers

templatetypedef

Jeff Ferland

Recent Activity

Donate For Us