Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does Top-K sort algorithm work in MongoDB

Based on the answer and from MongoDB Documentation, I understood that MongoDB is able to sort a large data set and provide sorted results when limit() is used. However, when the same data set is queried using sort() results into a memory exception.

From the second answer in the above post, poster mentions that whole collection is scanned, sorted and top N results are returned. I would like to know how the collection is sorted when I use limit(). From document I found that when limit() is used it does Top-K sort, however there is not much explanation available about it anywhere. I would like to see any references about Top-K Sort algorithm.

like image 611
Srinivas Mandava Avatar asked Mar 13 '17 15:03

Srinivas Mandava


1 Answers

In general, you can do an efficient top-K sort with a min-heap of size K. The min-heap represents the largest K elements seen so far in the data set. It also gives you constant-time access to the smallest element of those top K elements.

As you scan over the data set, if a given element is larger than the smallest element in the min-heap (i.e. the smallest of the largest top K so far), you replace the smallest from the min-heap with that element and re-heapify (O(lg K)).

At the end, you're left with the top K elements of the entire data set, without having had to sort them all (worst-case running time is O(N lg K)), using only Θ(K) memory.

I actually learnt this in school for a change :-)

like image 178
Cameron Avatar answered Nov 24 '22 03:11

Cameron