Efficient Out-Of-Core Sorting

Tags:

I'm trying to work out how to efficiently sort a huge dataset that won't fit in memory. The obvious answer at a high level is to sort a whole bunch of chunks that do fit in memory using some standard algorithm, write these out to disk, and then merge them. Merging them is the problem.

Let's say the data divides up into C chunks, so I have C files to merge. If I do a C-way merge in one pass, then technically I have an O(N^2) algorithm, though one that only has to perform O(N) writes to disk. If I iteratively merge them into C/2 files, then C/4 files, etc. then I have an O(N log N) algorithm, but one that has to perform O(N log N) writes to disk, and therefore has a huge constant term.

What is the typical solution to this conundrum? Is there any good one?

664

asked Oct 29 '09 18:10

dsimcha

2 Answers

It's funny as I heard this same question not a month ago... and the response that our local guru gave as well.

"Use the unix sort command"

Though we admitedly thought it was a joke at the expense of the asker... it turns out that it was not. The reasoning is that those smart guys already gave a lot of thought in how to solve the problem of very large files, and came up with a very impressive implementation which makes good use of the available resources.

Therefore, unless you plan in re-inventing the wheel: ie you have time and this is business critical, then simply using the unix sort is probably an excellent idea.

The only drawback is its arcane syntax. This page is dedicated to the command and various explanations.

My personal advise: take a small sample of the data for testing that the command effectively does exactly what you want.

answered Oct 06 '22 20:10

Matthieu M.

The simple answer is that there is no simple answer to this question. There are lots of answers, most of them fairly complex -- Knuth volume 3 (for one example) devotes a great deal of space to it.

One thing that becomes obvious when looking through what's been done is that you really want to minimize the number of runs you create during your initial sorting, and maximize the length of each. To do that, you generally want to read in about as much data as you can fit in memory, but instead of just sorting it and writing it out, you want to put it into a heap. Then as you write each record out, you read IN another record.

You then check whether that record would sort before or after the record you just wrote out. If you would sort after it, you insert it into your heap, and continue. If it would sort before, you insert it into a second heap.

You stop adding records to the current run when the first heap is completely empty, and your second heap is taking up all your memory. At that point, you repeat the process, writing a new run to a new file.

This will usually produce considerably longer intermediate runs in the initial phase, so merging them is substantially less work. Assuming the input records are in random order, you can expect this to approximately double the length of each run--but if the input is even partially sorted, this can take advantage of that existing ordering to extend the run lengths even more.

As an aside, I certainly didn't invent this -- I probably first read about it in Knuth, but perhaps in Algorithms + Data Structures = Programs (Niklaus Wirth) -- both discuss it. Knuth credits first publication of the method to "H. Seward", in his masters thesis at MIT in 1954. If you have the second edition of Knuth, it's on page 254 of volume 3. I don't have a copy of the third edition, so I don't have a page number for that.

answered Oct 06 '22 20:10

Jerry Coffin

Related questions
                            
                                $('#id tag') vs. $('#id').find('tag') - which is preferable?
                            
                                When to evaluate strictly in Haskell?
                            
                                MySQL view performance
                            
                                Why is there such a performance difference on Raspberry pi between Open and Oracle JDK?
                            
                                Scrape title by only downloading relevant part of webpage
                            
                                gdi+ Graphics::DrawImage really slow~~
                            
                                Android - Declarative vs Programmatic UI
                            
                                Uglify-js doesn't mangle variable names
                            
                                Measuring FPS/Performance in Xcode/Iphone
                            
                                SQL Server 2008 FILESTREAM performance
                            
                                Switch statements: do you need the last break? (Javascript mainly)
                            
                                Slow javascript execution in chrome, profiler yields "(program)"
                            
                                Verifying that two files are identical using pure PHP?
                            
                                Erlang (or elixir) performance (requests per second) is slow vs jruby?
                            
                                Which one is best CSV or JSON in order to import big data (PHP) [closed]
                            
                                Program exceeding theoretical memory transfer rate
                            
                                how to reduce ssl time of website
                            
                                C# Array or Dictionary?
                            
                                How to speed up python loop
                            
                                Shader position vec4 or vec3

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Efficient Out-Of-Core Sorting

Tags:

performance

language-agnostic

algorithm

sorting

dsimcha

People also ask

2 Answers

Matthieu M.

Jerry Coffin

Recent Activity

Donate For Us