Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

External Sort between two files

I'm trying to get my head around an external sort for a requirement I have - and I can't.

The requirement is to externally sort a file of an arbitrary size but using just the original file and one other (call them fileA and fileB) - two files including the original. I can read/write to either of these - so can swap between the two...

I cannot figure out how to implement this - as most sorting algorithms require you to be able to have an overview of the entire array in memory to sort it, surely?

Say I have a random integer array:

[1, 5, 8, 7, 3, 4, 1, 9, 0, 1, 8, 7, 7, 3, 2, 9, 1, 2];

And at any given time, I can only read four pages (e.g. four integers) into memory.

On each pass, this gives me five separate arrays to sort:

[1, 5, 8, 7]
[3, 4, 1, 9] 
[0, 1, 8, 7] 
[7, 3, 2, 9]
[1, 2]

If I apply an in-memory sort on these, I then get:

[1, 5, 7, 8]
[1, 3, 4, 9] 
[0, 1, 7, 8] 
[2, 3, 7, 9]
[1, 2]

But if I can only fit four pages into memory at any one time, I don't see how I can further sort these without some horrible complex algorithm which loops over the entire array again and again to ensure its all sorted.

I'm thoroughly confused - because without reading the entire array into memory, we have no idea what elements are before the four pages, or after - so we cannot truly sort them?

Can somebody help me please and explain the crucial step in solving this?

like image 588
keldar Avatar asked Oct 19 '25 23:10

keldar


2 Answers

Since, the basic idea of External Sort is to merge the lists that are bigger than the available memory, therefore you control the lists (which may be actually implemented as arrays or linked lists or anything else) by means of a handle over them. In order to read an element from a list, you will use some method like listHandle.getNextElement(). To write to disk in the list, use mergedDoubleSizedList.writeNextElement().

After you have:

[1, 5, 7, 8] // controlled using handle1
[1, 3, 4, 9] // controlled using handle2
[0, 1, 7, 8] // controlled using handle3
[2, 3, 7, 9] // controlled using handle4
[1, 2] // controlled using handle5

and that you read only 4 ints, you get the handle over the first two arrays (handle1 and handle2), read them element by element simultaneously, and write them back as one consolidated array which is sorted (mergedListHandle1). Like this:

[1, 1, 3, 4, 5, 7, 8, 9] // written by creating new handle - mergedListHandle1
[0, 1, 2, 3, 7, 7, 8, 9] // written by creating - mergedListHandle2
[1, 2] // written back by creating mergedListHandle3

Now again you get the handle over two of the arrays (mergedListHandle1 and mergedListHandle2) resulting from the previous step and keep merging them on and on until you are left with only two handles, resulting in one final sorted array. Please provide your code if you want the code based solution for the same.

At a time, you will have only 4 elements in the memory if that's what your memory allows. So, to merge the lists represented by handle1 and handle2, you will do the following:

  1. Read first element from handle1 and handle2 (1 and 1)
  2. Write the smaller of these two to mergedListHandle1 (i.e. write 1 of handle1)
    1. You may not flush the numbers in mergedListHandle1 at the moment.
  3. Read next element from handle1 (5)
  4. Write the smaller of the current numbers from handle1 and handle2 to mergedListHandle1
  5. Flush the contents of mergedListHandle1 when it is full
  6. Fetch next smaller handles from disks (handle3 and handle4) and repeat the same cycle with them while you write to the new bigger list handle called mergedListHandle2.
like image 120
displayName Avatar answered Oct 21 '25 13:10

displayName


Explanation for a simple 2 way bottom up merge sort. Consider the data as 18 runs of size 1. Since the size is 1, each run can be considered sorted.

[1] [5] [8] [7] [3] [4] [1] [9] [0] [1] [8] [7] [7] [3] [2] [9] [1] [2]

On each pass, even runs are merged with odd runs from left to right from a source array or file into a destination array or file. After each pass, the roles of source and destination are swapped.

First pass

[1 5] [7 8] [3 4] [1 9] [0 1] [7 8] [3 7] [2 9] [1 2]

Second pass

[1 5 7 8] [1 3 4 9] [0 1 7 8] [2 3 7 9] [1 2]

Third pass

[1 1 3 4 5 7 8 9] [0 1 2 3 7 7 8 9] [1 2]

Forth pass

[0 1 1 1 2 3 3 4 5 7 7 7 8 8 9 9] [1 2]

Fifth pass (done)

[0 1 1 1 1 2 2 3 3 4 5 7 7 7 8 8 9 9]

The main variables are run size, and 4 indices for the starts and ends of a pair of runs (even and odd). The last run ends at the end of data and could be an even or odd run. For an external sort, the indices become file pointers.

An external sort only needs 2 memory locations to hold data, one element from an even run, one element from an odd run. The two elements are compared, the lower or equal one written to the destination file, and the next element from that run is read. If the end of that run is reached, then the other element is written and the rest of the other run is copied to the destination file. The two file pointers are advanced to the start of the next pair of even and odd runs, and the merge is continued, until the end of data is reached, ending the merge pass. Run size is doubled, the roles of source and destination files are swapped, and the next merge pass is done, repeating until run size become >= number of elements.

like image 26
rcgldr Avatar answered Oct 21 '25 13:10

rcgldr



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!