Suppose I have X GB of RAM space available and I need to sort a huge array of data (much greater the all the available memory. It's stored on the hard drive). Could you give a hint, how that could be accomplished?
Solution 1: step 1: Read M from the file and write it into a temp buffer. step 2: Sort (you should use in-place sorting algorithm, like QuickSort,HeapSort) the values. step 3: Create a temp file and write the buffer into the temp file.
In place sorting algorithms are the most memory efficient, since they require practically no additional memory. Linked list representations require an additional N words of memory for a list of pointers.
Suppose we have to sort a 1GB file of random integers and the available ram size is 200 Mb, how will it be done? The easiest way to do this is to use external sorting. We divide our source file into temporary files of size equal to the size of the RAM and first sort these files.
Memory object sorting uses a memory object in 64-bit virtual storage to improve the performance of sort applications that use DFSORT's Blockset Technique. A memory object is a data area in virtual storage that is allocated above the bar and backed by central storage.
You are looking for external sorting. The largest cost in these scenarios is often disk IO. So the trick is to use an algorithm that minimises disk IO. The usual approach is to read suitably large chunks in to memory, sort those chunks, saving them back to disk, then merge the sorted chunks.
A search for "external sorting" or "sort merge" and your technologies of choice should give some good results.
Let us assume you have a huge file H and limit memory M.
I have two solutions.
Solution 1:
step 1: Read M from the file and write it into a temp buffer.
step 2: Sort (you should use in-place sorting algorithm, like QuickSort,HeapSort) the values.
step 3: Create a temp file and write the buffer into the temp file. Save the name of the temp file.
step 4: Repeat step 1 to 3 until read file H out, and sort M out and save all temp files.
step 5: Merge all temp files to a new file. Create a new file,and open all the temp files,put the file handles into a array.
step 6: Find the minimum number from the set of numbers currently pointed to by the file read pointer. You can use normal way to do that, compare every number, and use a temp variable to remember it (time complexity is O(n). Or you can create a priorityQueue,and a map, the key of the map is the number, and the value of the map is the sequence of the file handles.(time complexity is O(lgn),the second way wastes more memory, but performance is better,if you want a better way, you can use a int to replace list using bitmap, becase the temp file names are sequeced.
step 7: Save the number to the new file.
step 8: Read another number from the file that contained the minimum number at step 6.
step 9: Repeat step 7 to 8 until all numbers from all the temp files have been processed.
Solution 2:
step 1 : Create N temp files, the range of the numbers of every temp files are different.(For example,the range of temp_file_1 is from 0 to 1000000, and the next temp file is from 1000001 to 2000000...)
step 2: Read m from the H file and write the number into the different temp files until nothing can read from file H.
step 3: Sort every temp files.
step 4: Create a new file, merge all temp files into this new file one by one.
What's the difference between the solutions. According to solution 1, every number is sorted once and is compared many times(step 5), but read and write only twice. about solution 2, no merge processing, but every single number is read and wrote three times.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With