Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I sort a partitioned array efficiently?

I have K number of files. I call them X1, X2, ... ,XK.
Each of these files is a N x 1 array of doubles.
It means that I actually have a NK x 1 array, partitioned in K arrays. Lets call this large array X.

I need to sort X and I cannot load all data into the memory. What is the efficient algorithm to perform this sort and save the results in separate files?

I know (of course not sure efficient) How to do it, if I just want to sort H elements:

  1. sort X1 and save it as sX1
  2. A = sX1(1:H,1) //in Matlab
  3. sort X2 and A
  4. repeat steps 1,2 and 3 for other files

But H cannot be very large, again because of memory problems.

Update
The Sort with the limited memory question is different from this question, although it helped. If I want to use that questions answer or MikeB's answer, then this should be answered too: Should I merge the K files into one file and then use external sort algorithm. If yes, How?

Thanks.

like image 500
Ramin Avatar asked Dec 18 '12 16:12

Ramin


1 Answers

What you're attempting is called an external sort. Each partition gets sorted by itself. Then, you have to merge all the partitions to build the final sorted list. If you're only looking for the top few items you can exit the merge early.

There seem to be a few existing solutions matlab solutions for external merges. Here's a link to one over at the mathworks file exchange site: http://www.mathworks.com/matlabcentral/fileexchange/29306-external-merge-sort/content/ext_merge/merge.m

Update: the code I linked shows how it's done in matlab. Specifically, the code here: http://www.mathworks.com/matlabcentral/fileexchange/29306-external-merge-sort/content/ext_merge/extmerge.m takes a list of files that need to be merged, and eventually merges them to one file.

In your original problem statement, you said you have K files, from X1 thru XK. An external sort first sorts those files, then merges them into one file. A simple implementation would have pseudocode like this:

// external merge-sort algorithm
For each file F in (X1 ... XK)
    Read file F into memory array R
    Sort R
    Overwrite file F with sorted data from R
    Clear array R in memory
For N = K-1 down to 1
    in-order merge file XN+1 and XN into file X'
    erase file XN+1 and XN
    rename file X' as XN

You should see that the first phase is to sort. We read each file into memory, sort it, and write it back out. This is I/O, but it's efficient; hopefully, we're using as much memory as possible so that we sort in memory as much as we can. At the end of that first loop, we have K files, each one sorted within its own domain of values.

Given the K sorted files, our next step is to merge them. Merging two files doesn't use any memory, but does lots of I/O. Merging two files looks like this, given two files named L and R we can merge them into O:

// merge two files algorithm
Get value LV from L
Get value RV from R
While L is not EOF AND R is not EOF
    if ( LV <= RV )
        write LV into O
        get value LV from L
    else 
        write RV into O
        get value RV from R
While L is not EOF
    get LV from L
    write LV into O
While R is not EOF
    get RV from R
    write RV into O

The second loop in the merge-sort will merge two files, N+1 and N into a single file N. It loops through each of your files and merges them. This reads and re-writes lots of data, and you can get a bit more efficient than that by handling multiple files in a loop. But it works fine as I've written it.

like image 52
MikeB Avatar answered Oct 29 '22 11:10

MikeB