Based on the answer and from MongoDB Documentation, I understood that MongoDB is able to sort a large data set and provide sorted results when limit() is used. However, when the same data set is queried using sort() results into a memory exception.
From the second answer in the above post, poster mentions that whole collection is scanned, sorted and top N results are returned. I would like to know how the collection is sorted when I use limit(). From document I found that when limit() is used it does Top-K sort, however there is not much explanation available about it anywhere. I would like to see any references about Top-K Sort algorithm.
In general, you can do an efficient top-K sort with a min-heap of size K. The min-heap represents the largest K elements seen so far in the data set. It also gives you constant-time access to the smallest element of those top K elements.
As you scan over the data set, if a given element is larger than the smallest element in the min-heap (i.e. the smallest of the largest top K so far), you replace the smallest from the min-heap with that element and re-heapify (O(lg K)
).
At the end, you're left with the top K elements of the entire data set, without having had to sort them all (worst-case running time is O(N lg K)
), using only Θ(K)
memory.
I actually learnt this in school for a change :-)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With