Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Better sort algorithm for pre-sorted chunks

I have a very large list of elements (few millions) which are locally sorted in chunks of variable size.

a b c b i h a j k l e f g a b l...

I know the size, starting and ending position of every chunk in advance.

[a b c] [b i h m] [a j k l] [e f g] [a] [b l] ...

Is there a faster way to sort the list using an algorithm that can make use of the boundary information?

like image 877
Rohit Avatar asked Aug 11 '17 09:08

Rohit


People also ask

Which sort is best for already sorted data?

Insertion sort runs much more efficiently if the array is already sorted or "close to sorted." Selection sort always performs O(n) swaps, while insertion sort performs O(n2) swaps in the average and worst case.

Which sorting algorithm is best suited if the elements are already sorted?

Merge Sort Was this answer helpful?

Which is the fastest sorting algorithm when the list is nearly sorted?

Insertion sort is the clear winner on this initial condition. Bubble sort is fast, but insertion sort has lower overhead. Shell sort is fast because it is based on insertion sort. Merge sort, heap sort, and quick sort do not adapt to nearly sorted data.

Which sorting algorithm is best suited for insanely huge data?

In this case, bubblesort was the right answer even though it has a higher big-O complexity then any number of other algorithms.


Video Answer


2 Answers

The keyword you are looking for is k-way/multiway merging:
You have k separate sorted lists and want to merge them into a single list of size n.

There are two basic approaches that work have similar asymptotic runtime characteristics but work differently in practice:

Iterative 2-way merging

(conceptually) builds a balanced binary tree with the lists as leaves and iteratively merges them together until you have only a single list left. This gives you log k merge operations (height of the tree) where every merge operation takes time n.
This is basically what rakwaht and schnaader described.

Direct k-way merging

uses a (binary) heap or a tournament tree to store the smallest element that has not yet been merged for every chunk. Removing the minimum element from this data structure causes the next element from the corresponding chunk to be inserted. So a step of the algorithm takes O(log k) and is repeated n times, thus giving the same runtime as the iterative binary merge.

Note that the tournament tree approach is more efficient in practice since the traversal of the tournament tree is less data-dependent than the binary heap.

You can also always think about a solution in between these two approaches, like doing 16-way merges, which might be more efficient than one of the two 'extreme' approaches above.

The (iterated) multiway merge approach might seem more complex, however for applications that process huge amounts of data (external memory operations which store most of the data on a hard disk), this is much more efficient since much fewer merging steps are necessary.

like image 55
Tobias Ribizel Avatar answered Sep 25 '22 23:09

Tobias Ribizel


This is similar to the second phase of a merge sort where pairs of sorted sublists are merged together. I would try the following approach:

  1. Divide the list into chunks
  2. Merge two chunks together with the pseudocode below
  3. Repeat step 2 until only one chunk is left, this one is your sorted list

Pseudocode for the merge part:

merge(chunk1[], chunk2[], chunk1_length, chunk2_length) {
  chunk1_pointer = chunk2_pointer = 0
  repeat the following:
    compare chunk1[chunk1_pointer] and chunk2[chunk2_pointer]
      same value: add both to the output, increase both pointers
      chunk1 value larger: add chunk2 value to the output, increase chunk2_pointer
      chunk2 value larger: add chunk1 value to the output, increase chunk1_pointer
    is one of the pointers at the end of the chunk?
      add the remaining elements of the other chunk to the output and exit
}
like image 21
schnaader Avatar answered Sep 21 '22 23:09

schnaader