Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm for N-way merge

A 2-way merge is widely studied as a part of Mergesort algorithm. But I am interested to find out the best way one can perform an N-way merge?

Lets say, I have N files which have sorted 1 million integers each. I have to merge them into 1 single file which will have those 100 million sorted integers.

Please keep in mind that use case for this problem is actually external sorting which is disk based. Therefore, in real scenarios there would be memory limitation as well. So a naive approach of merging 2 files at a time (99 times) won't work. Lets say we have only a small sliding window of memory available for each array.

I am not sure if there is already a standardized solution to this N-way merge. (Googling didn't tell me much).

But if you know if a good n-way merge algorithm, please post algo/link.

Time complexity: If we greatly increase the number of files (N) to be merged, how would that affect the time complexity of your algorithm?

Thanks for your answers.

I haven't been asked this anywhere, but I felt this could be an interesting interview question. Therefore tagged.

like image 240
bits Avatar asked Feb 20 '11 07:02

bits


People also ask

How do you write an algorithm for merge sort?

Algorithm for Merge SortStep 1: Find the middle index of the array. Step 2: Divide the array from the middle. Step 4: Call merge sort for the second half of the array. Step 5: Merge the two sorted halves into a single sorted array.

What is two-way merge sort algorithm?

Another name for an iterative 2-way merge sort is bottom up merge sort, while another name for recursive merge sort is top down merge sort. Generally, an optimized bottom up merge sort is slightly more efficient than an optimized top down merge sort.

How do you use merge algorithms?

Working of Merge sort Algorithm According to the merge sort, first divide the given array into two equal halves. Merge sort keeps dividing the list into equal parts until it cannot be further divided. As there are eight elements in the given array, so it is divided into two arrays of size 4.


2 Answers

How about the following idea:

  1. Create a priority queue

  2. Iterate through each file f
    1. enqueue the pair (nextNumberIn(f), f) using the first value as priority key

  3. While queue not empty
    1. dequeue head (m, f) of queue
    2. output m
    3. if f not depleted
      1. enqueue (nextNumberIn(f), f)

Since adding elements to a priority queue can be done in logarithmic time, item 2 is O(N × log N). Since (almost all) iterations of the while loop adds an element, the whole while-loop is O(M × log N) where M is the total number of numbers to sort.

Assuming all files have a non-empty sequence of numbers, we have M > N and thus the whole algorithm should be O(M × log N).

like image 65
aioobe Avatar answered Sep 19 '22 19:09

aioobe


Search for "Polyphase merge", check out classics - Donald Knuth & E.H.Friend.

Also, you may want to take a look at the proposed Smart Block Merging by Seyedafsari & Hasanzadeh, that, similarly to earlier suggestions, uses priority queues.

Another interesting reasonsing is In Place Merging Algorithm by Kim & Kutzner.

I also recommend this paper by Vitter: External memory algorithms and data structures: dealing with massive data.

like image 24
Grigori Melnik Avatar answered Sep 18 '22 19:09

Grigori Melnik