I'm trying to work out how to efficiently sort a huge dataset that won't fit in memory. The obvious answer at a high level is to sort a whole bunch of chunks that do fit in memory using some standard algorithm, write these out to disk, and then merge them. Merging them is the problem.
Let's say the data divides up into C chunks, so I have C files to merge. If I do a C-way merge in one pass, then technically I have an O(N^2) algorithm, though one that only has to perform O(N) writes to disk. If I iteratively merge them into C/2 files, then C/4 files, etc. then I have an O(N log N) algorithm, but one that has to perform O(N log N) writes to disk, and therefore has a huge constant term.
What is the typical solution to this conundrum? Is there any good one?
Quicksort. Quicksort is one of the most efficient sorting algorithms, and this makes of it one of the most used as well. The first thing to do is to select a pivot number, this number will separate the data, on its left are the numbers smaller than it and the greater numbers on the right.
Sorting algorithms are usually judged by their efficiency. In this case, efficiency refers to the algorithmic efficiency as the size of the input grows large and is generally based on the number of elements to sort. Most of the algorithms in use have an algorithmic efficiency of either O(n^2) or O(n*log(n)).
The universally-acclaimed worst sorting algorithm is Bogosort, sometimes called Monkey Sort or Random Sort, for reasons we'll see shortly.
Moreover, on large number of processors, parallel Quicksort achieves the best parallel efficiency of up to 88%, while Merge sort and Merge- Quicksort algorithms achieve up to 49% and 52% parallel efficiency, respectively.
It's funny as I heard this same question not a month ago... and the response that our local guru gave as well.
"Use the unix sort command"
Though we admitedly thought it was a joke at the expense of the asker... it turns out that it was not. The reasoning is that those smart guys already gave a lot of thought in how to solve the problem of very large files, and came up with a very impressive implementation which makes good use of the available resources.
Therefore, unless you plan in re-inventing the wheel: ie you have time and this is business critical, then simply using the unix sort
is probably an excellent idea.
The only drawback is its arcane syntax. This page is dedicated to the command and various explanations.
My personal advise: take a small sample of the data for testing that the command effectively does exactly what you want.
The simple answer is that there is no simple answer to this question. There are lots of answers, most of them fairly complex -- Knuth volume 3 (for one example) devotes a great deal of space to it.
One thing that becomes obvious when looking through what's been done is that you really want to minimize the number of runs you create during your initial sorting, and maximize the length of each. To do that, you generally want to read in about as much data as you can fit in memory, but instead of just sorting it and writing it out, you want to put it into a heap. Then as you write each record out, you read IN another record.
You then check whether that record would sort before or after the record you just wrote out. If you would sort after it, you insert it into your heap, and continue. If it would sort before, you insert it into a second heap.
You stop adding records to the current run when the first heap is completely empty, and your second heap is taking up all your memory. At that point, you repeat the process, writing a new run to a new file.
This will usually produce considerably longer intermediate runs in the initial phase, so merging them is substantially less work. Assuming the input records are in random order, you can expect this to approximately double the length of each run--but if the input is even partially sorted, this can take advantage of that existing ordering to extend the run lengths even more.
As an aside, I certainly didn't invent this -- I probably first read about it in Knuth, but perhaps in Algorithms + Data Structures = Programs (Niklaus Wirth) -- both discuss it. Knuth credits first publication of the method to "H. Seward", in his masters thesis at MIT in 1954. If you have the second edition of Knuth, it's on page 254 of volume 3. I don't have a copy of the third edition, so I don't have a page number for that.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With