I'm trying to get my head around an external sort for a requirement I have - and I can't.
The requirement is to externally sort a file of an arbitrary size but using just the original file and one other (call them fileA
and fileB
) - two files including the original. I can read/write to either of these - so can swap between the two...
I cannot figure out how to implement this - as most sorting algorithms require you to be able to have an overview of the entire array in memory to sort it, surely?
Say I have a random integer array:
[1, 5, 8, 7, 3, 4, 1, 9, 0, 1, 8, 7, 7, 3, 2, 9, 1, 2];
And at any given time, I can only read four pages (e.g. four integers) into memory.
On each pass, this gives me five separate arrays to sort:
[1, 5, 8, 7]
[3, 4, 1, 9]
[0, 1, 8, 7]
[7, 3, 2, 9]
[1, 2]
If I apply an in-memory sort on these, I then get:
[1, 5, 7, 8]
[1, 3, 4, 9]
[0, 1, 7, 8]
[2, 3, 7, 9]
[1, 2]
But if I can only fit four pages into memory at any one time, I don't see how I can further sort these without some horrible complex algorithm which loops over the entire array again and again to ensure its all sorted.
I'm thoroughly confused - because without reading the entire array into memory, we have no idea what elements are before the four pages, or after - so we cannot truly sort them?
Can somebody help me please and explain the crucial step in solving this?
Since, the basic idea of External Sort is to merge the lists that are bigger than the available memory, therefore you control the lists (which may be actually implemented as arrays or linked lists or anything else) by means of a handle over them. In order to read an element from a list, you will use some method like listHandle.getNextElement()
. To write to disk in the list, use mergedDoubleSizedList.writeNextElement()
.
After you have:
[1, 5, 7, 8] // controlled using handle1
[1, 3, 4, 9] // controlled using handle2
[0, 1, 7, 8] // controlled using handle3
[2, 3, 7, 9] // controlled using handle4
[1, 2] // controlled using handle5
and that you read only 4 ints, you get the handle over the first two arrays (handle1 and handle2), read them element by element simultaneously, and write them back as one consolidated array which is sorted (mergedListHandle1). Like this:
[1, 1, 3, 4, 5, 7, 8, 9] // written by creating new handle - mergedListHandle1
[0, 1, 2, 3, 7, 7, 8, 9] // written by creating - mergedListHandle2
[1, 2] // written back by creating mergedListHandle3
Now again you get the handle over two of the arrays (mergedListHandle1 and mergedListHandle2) resulting from the previous step and keep merging them on and on until you are left with only two handles, resulting in one final sorted array. Please provide your code if you want the code based solution for the same.
At a time, you will have only 4 elements in the memory if that's what your memory allows. So, to merge the lists represented by handle1 and handle2, you will do the following:
1
and 1
)1
of handle1)
5
)Explanation for a simple 2 way bottom up merge sort. Consider the data as 18 runs of size 1. Since the size is 1, each run can be considered sorted.
[1] [5] [8] [7] [3] [4] [1] [9] [0] [1] [8] [7] [7] [3] [2] [9] [1] [2]
On each pass, even runs are merged with odd runs from left to right from a source array or file into a destination array or file. After each pass, the roles of source and destination are swapped.
First pass
[1 5] [7 8] [3 4] [1 9] [0 1] [7 8] [3 7] [2 9] [1 2]
Second pass
[1 5 7 8] [1 3 4 9] [0 1 7 8] [2 3 7 9] [1 2]
Third pass
[1 1 3 4 5 7 8 9] [0 1 2 3 7 7 8 9] [1 2]
Forth pass
[0 1 1 1 2 3 3 4 5 7 7 7 8 8 9 9] [1 2]
Fifth pass (done)
[0 1 1 1 1 2 2 3 3 4 5 7 7 7 8 8 9 9]
The main variables are run size, and 4 indices for the starts and ends of a pair of runs (even and odd). The last run ends at the end of data and could be an even or odd run. For an external sort, the indices become file pointers.
An external sort only needs 2 memory locations to hold data, one element from an even run, one element from an odd run. The two elements are compared, the lower or equal one written to the destination file, and the next element from that run is read. If the end of that run is reached, then the other element is written and the rest of the other run is copied to the destination file. The two file pointers are advanced to the start of the next pair of even and odd runs, and the merge is continued, until the end of data is reached, ending the merge pass. Run size is doubled, the roles of source and destination files are swapped, and the next merge pass is done, repeating until run size become >= number of elements.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With