Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Time Complexity/Cost of External Merge Sort

I got this from a link which talks about external merge sort.

From slide 6 Example: with 5 buffer pages, to sort 108 page file

  • Pass0: [108/5] = 22 sorted runs of 5 pages each (last run only with 3 pages)

  • Pass1 [22/4] = 6 sorted runs of 20 pages each (last run only with 8 pages)

  • Pass2: [6/3] = 2 sorted runs, 80 pages and 28 pages

  • Pass 3: [2/2] = 1 Sorted file of 108 pages

Question: My understanding is in external merge sort, in pass 0 you create chunks and then sort each chunk. In remaining passes you keep merging them. So, applying that to the above example, since we have only 5 buffer pages, in Pass 0 its clear we need 22 sorted runs of 5 pages each.

  1. Now, why are we doing sorted runs for remaining passes instead or merging ?

  2. How come it tells for pass 1, 6 sorted runs of 20 pages each when we have only 5 buffer pages ?

  3. Where exactly is merge happening here ? and how is N reducing in each pass i.e from 108 to 22 to 6 to 2 ?

like image 773
Frank Q. Avatar asked Apr 28 '12 01:04

Frank Q.


1 Answers

External Merge Sort is necessary when you cannot store all the data into memory. The best you can do is break the data into sorted runs and merge the runs in subsequent passes. The length of a run is tied to your available buffer size.

Pass0: you are doing the operations IN PLACE. So you load 5 pages of data into the buffers and then sort it in place using an in place sorting algorithm. These 5 pages will be stored together as a run.

Following passes: you can no longer do the operations in place since you're merging runs of many pages. 4 pages are loaded into the buffers and the 5th is the write buffer. The merging is identical to the merge sort algorithm, but you will be dividing and conquering by a factor of B-1 instead of 2. When the write buffer is filled, it is written to disk and the next page is started.

Complexity: When analyzing the complexity of external merge sort, the number of I/Os is what is being considered. In each pass, you must read a page and write the page. Let N be the number of pages. Each run will cost 2N. Read the page, write the page.
Let B be the number of pages you can hold buffer space and N be the number of pages.
There will be ceil(log_B-1(ceil(N/B))) passes. Each pass will have 2N I/Os. So O(nlogn).

In each pass, the page length of a run is increasing by a factor of B-1, and the number of sorted runs is decreasing by a factor of B-1.
Pass0: ceil(108 / 5) = 22, 5 pages per run
Pass1: ceil(22 / 4) = 6, 20 pages per run
Pass2: ceil(6 / 4 ) = 2, 80 pages per run
Pass3: ceil(2 / 4 ) = 1 - done, 1 run of 108 pages

like image 116
JustinDanielson Avatar answered Oct 22 '22 13:10

JustinDanielson