Which sorting algorithm works best on very large data set [closed]

Tags:

I was searching on the Internet to find which sorting algorithm is best suitable for a very large data set. I found that many have an opinion that merge sort is best because it is fair, as well as that it ensures that time complexity is O(n log n) and quick sort is not safe: It is also true that variations of quicksort can also be not safe because the real data set can be anything.

If swapping of the two elements has negligible time cost, then why can't we choose heap sort as the best sorting algorithm in this case because it is in place as well as O(n log n)?.

In case of Merge sort it requires another O(n) space; if the data is very large then we can't use this algorithm.

Please tell me: which algorithm should be the best in this scenario?.

848

asked Aug 26 '15 19:08

Ankit Kumar Namdeo

2 Answers

There's no one algorithm that's clearly the "best" algorithm. If there were, we'd be using it everywhere! Instead, it depends on a bunch of factors.

For starters, can you fit your data into main memory? If you can't, then you'd need to rely on an external sorting algorithm. These algorithms are often based on quicksort and mergesort.

Second, do you know anything about your input distribution? If it's mostly sorted, then something like Timsort might be a great option, since it's designed to work well on sorted data. If it's mostly random, Timsort is probably not a good choice.

Third, what kind of elements are you sorting? If you are sorting generic objects, then you're pretty much locked into comparison sorting. If not, perhaps you could use a non-comparison sort like counting sort or radix sort.

Fourth, how many cores do you have? Some sorting algorithms (quicksort, mergesort, MSD radix sort) parallelize really well, while others do not (heapsort).

Fifth, how are your data represented? If they're stored in an array, quicksort or a quicksort variant will likely do well because of locality of reference, while mergesort might be slow due to the extra memory needed. If they're in a linked list, though, the locality of reference from quicksort goes away and mergesort suddenly becomes competitive again.

The best option is probably to take a lot of different factors into account and then make a decision from there. One of the reason it's so fun to design and study algorithms is that there's rarely one single best choice; often, the best option depends a ton on your particular situation and changes based on what you're seeing.

(You mentioned a few details about quicksort, heapsort, and mergesort that I wanted to touch on before wrapping up this answer. While you're right that quicksort has a degenerate O(n²) worst case, there are many ways to avoid this. The introsort algorithm keeps track of the recursion depth and switches the algorithm to heapsort if it looks like the quicksort will degenerate. This guarantees O(n log n) worst-case behavior with low memory overhead and maximizes the amount of benefit you get from quicksort. Randomized quicksort, while still having an O(n²) worst case, has a vanishingly small probability of actually hitting that worst case.

Heapsort is a good algorithm in practice, but isn't as fast as the other algorithms in some cases because it doesn't have good locality of reference. That said, the fact that it never degenerates and needs only O(1) auxiliary space is a huge selling point.

Mergesort does need a lot of auxiliary memory, which is one reason why you might not want to use it if you have a huge amount of data to sort. It's worth knowing about, though, since its variants are widely used.)

163

answered Nov 02 '22 23:11

templatetypedef

Your question is too open-ended to be answered specifically. There are a number of efficient sorting algorithms and each has its own strengths and weaknesses. If you know your data, it is possible that an optimal efficiency algorithm (heap, quick, merge, etc) is not the right tool for the job.

For example, in a recent product, we were required to keep the bookmarks in a Word document sorted by their order of appearance. The bookmarks could become unsorted due to editing of the document (copy, cut, paste) so after each of those operations it was important to resort the list. In this case, bubblesort was the right answer even though it has a higher big-O complexity then any number of other algorithms. The fact that the sort is efficient when the list is nearly sorted (which is usually the case in this circumstance) and it's an in-place operation meant that it was the right tool for the job.

Take a hard look at your data and read up on the various strengths and weaknesses of the well-known sorting algorithms and you'll be well on your way to answering your own question.

answered Nov 02 '22 23:11

P. Hinker

Related questions
                            
                                Cut rectangle in minimum number of squares
                            
                                Algorithm for efficient diffing of huge files
                            
                                evaluate whether a number is integer power of 4
                            
                                Why does the order of loops in a matrix multiply algorithm affect performance? [duplicate]
                            
                                Algorithm for minimum manhattan distance
                            
                                Time complexity of N Queen using backtracking?
                            
                                Find smallest number in Sorted Rotatable Array
                            
                                Multi-threaded algorithm for solving sudoku?
                            
                                Fractal Encryption
                            
                                Creating a power set of a Sequence
                            
                                How to sort an array using minimum number of writes?
                            
                                Algorithm / pseudo-code to create paging links?
                            
                                How do you use a Bidirectional BFS to find the shortest path?
                            
                                Generate Random numbers without using any external functions
                            
                                Finding the closest fibonacci numbers
                            
                                Node search in Binary Tree overflows stack
                            
                                How to find the closest element to a given key value in a binary search tree?
                            
                                Print all the permutations of a string in C
                            
                                Optimized TSP Algorithms
                            
                                How can I efficiently calculate the binomial cumulative distribution function?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Which sorting algorithm works best on very large data set [closed]

Tags:

algorithm

sorting

Ankit Kumar Namdeo

People also ask

2 Answers

templatetypedef

P. Hinker

Recent Activity

Donate For Us