Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C# N way merge for external sort

Tags:

c#

merge

sorting

Whats the best way to implement an N way merge for N sorted files?

Lets say I have 9 sorted files with 10 records each? How do I merge these files to create a big file with 90 sorted records?

like image 479
user262102 Avatar asked Feb 18 '10 17:02

user262102


2 Answers

I'm assuming that there could be a lot more data then you gave in your example. If you can open all the files simultaneously you can use this algorithm:

  • Read the first line from each file, so you have 10 lines in memory, one from each file.
  • Put the lines into a priority queue by sort order.
  • Take the least element (sorted first) out of the priority queue and write to the output file.
  • Read one more line from the corresponding file the line came from and put that into the priority queue.
  • Repeat until all files are read to the end.

Note that you don't have to read all the files into memory at once, so this will work well if you have a reasonable number of large files, but not if you have a lot of small files.

If you have a lot of small files, you should merge them in groups to make a single output file for each group, then repeat the process to merge these new groups.

In C# you can use for example a SortedDictionary to implement the priority queue.

like image 86
Mark Byers Avatar answered Nov 10 '22 20:11

Mark Byers


Addressing the comments in the other answer:

If you have an variable number of files, here's what I'd do. This is just a sketch to get the idea across; this code doesn't compile, I've gotten the method names wrong, and so on.

// initialize the data structures
var priorityQueue = new SortedDictionary<Record, Stream>();
var streams = new List<Stream>();
var outStream = null; 
try
{
  // open the streams.
  outStream = OpenOutputStream();
  foreach(var filename in filenames)
    streams.Add(GetFileStream(filename));
  // initialize the priority queue
  foreach(var stream in streams)
  {
    var record = ReadRecord(stream);
    if (record != null)
      priorityQueue.Add(record, stream);
  // the main loop
  while(!priorityQueue.IsEmpty)
  {
     var record = priorityQueue.Smallest;
     var smallestStream = priorityQueue[record];
     WriteRecord(record, outStream);
     priorityQueue.Remove(record);
     var newRecord = ReadRecord(smallestStream);
     if (newRecord != null)
       priorityQueue.Add(newRecord, smallestStream);
  }
}
finally { clean up the streams }

Does that make sense? You just keep on grabbing the smallest thing out of the priority queue and replacing it with the next record in that stream, if there is one. Eventually the queue will be empty and you'll be done.

like image 42
Eric Lippert Avatar answered Nov 10 '22 20:11

Eric Lippert