Sort with the limited memory

Tags:

Suppose I have X GB of RAM space available and I need to sort a huge array of data (much greater the all the available memory. It's stored on the hard drive). Could you give a hint, how that could be accomplished?

696

asked Dec 05 '10 09:12

Maksim Skurydzin

2 Answers

You are looking for external sorting. The largest cost in these scenarios is often disk IO. So the trick is to use an algorithm that minimises disk IO. The usual approach is to read suitably large chunks in to memory, sort those chunks, saving them back to disk, then merge the sorted chunks.

A search for "external sorting" or "sort merge" and your technologies of choice should give some good results.

102

answered Oct 04 '22 16:10

philiphobgen

Let us assume you have a huge file H and limit memory M.
I have two solutions.

Solution 1:
step 1: Read M from the file and write it into a temp buffer.
step 2: Sort (you should use in-place sorting algorithm, like QuickSort,HeapSort) the values.
step 3: Create a temp file and write the buffer into the temp file. Save the name of the temp file.
step 4: Repeat step 1 to 3 until read file H out, and sort M out and save all temp files.
step 5: Merge all temp files to a new file. Create a new file,and open all the temp files,put the file handles into a array.
step 6: Find the minimum number from the set of numbers currently pointed to by the file read pointer. You can use normal way to do that, compare every number, and use a temp variable to remember it (time complexity is O(n). Or you can create a priorityQueue,and a map, the key of the map is the number, and the value of the map is the sequence of the file handles.(time complexity is O(lgn),the second way wastes more memory, but performance is better,if you want a better way, you can use a int to replace list using bitmap, becase the temp file names are sequeced.
step 7: Save the number to the new file.
step 8: Read another number from the file that contained the minimum number at step 6.
step 9: Repeat step 7 to 8 until all numbers from all the temp files have been processed.

Solution 2: step 1 : Create N temp files, the range of the numbers of every temp files are different.(For example,the range of temp_file_1 is from 0 to 1000000, and the next temp file is from 1000001 to 2000000...)
step 2: Read m from the H file and write the number into the different temp files until nothing can read from file H.
step 3: Sort every temp files.
step 4: Create a new file, merge all temp files into this new file one by one.

What's the difference between the solutions. According to solution 1, every number is sorted once and is compared many times(step 5), but read and write only twice. about solution 2, no merge processing, but every single number is read and wrote three times.

answered Oct 04 '22 16:10

Jiaji Li

Related questions
                            
                                enable_shared_from_this - empty internal weak pointer?
                            
                                Defining lists as global variables in Python
                            
                                What is the performance difference between "insert ignore" and replace in MySQL?
                            
                                Remote debugging in visual studio: remote debugger does not support this edition of windows
                            
                                Django models: Only permit one entry in a model?
                            
                                Do any boost::asio async calls automatically time out?
                            
                                Instantiate all classes implementing a specific interface
                            
                                How to use a table type in a SELECT FROM statement?
                            
                                Xcode 4: deleting items in Build Settings
                            
                                How to add a StackPanel in a Button in C# code behind
                            
                                Parse XML file into Python object
                            
                                Is it wise to access read-only data from multiple threads simultaneously?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With