I'm trying to determine the fastest runtime for getting k (key,value) pairs based on the smallest k keys in a dictionary. i.e.: for
mynahs = {40:(1,3),5:(5,6),11:(9,2),2:(6,3),300:(4,4),15:(2,8)}
smallestK(mynahs,3)
would return:
[(2,(6,3)),(5,(5,6)),(11,(9,2))]
I've seen a few different ways to do this:
1.
mylist = list(mynahs.keys())
mylist.sort
mylist = mylist[:k]
return [(k, mynahs[k]) for k in mylist]
but everyone seems to think heapq is the fastest
cheap = heapq.nsmallest(3, mynahs)
return [(k, mynahs[k]) for k in cheap]
How does heapq.nsmallest work and why is it fastest? I have seen this question and this one I still don't understand. Is heapq using a minheap to get the nsmallest? How does that work? I've also heard about an algorithm called quickselect, is that what it's using?
What's the runtime of it? If the dictionary is constantly changing/updating, is calling heapq.nsmallest each time you need the nsmallest the fastest way to do that?
The code for heapq.py is available at https://svn.python.org/projects/python/trunk/Lib/heapq.py
nsmallest uses one of two algorithms. If the number of items to be returned is more than 10% of the total number of items in the heap, then it makes a copy of the list, sorts it, and returns the first k items.
If k is smaller than n/10, then it uses a heap selection algorithm:
Make a copy of the first k items, and sort it
for each remaining item in the original heap
if the item is smaller than the largest item in the new list
replace the largest item with the new item
re-sort the new list
That whoever wrote this used such an inefficient algorithm is somewhat surprising. In theory at least, Quick select, which is an O(n) algorithm, should be faster than sorting, and much faster than the "optimized" algorithm for selecting n/10 items.
I'm not a Python guy, so I can't say for sure, but my experience with other languages indicates that the above should be true for Python as well.
The implementation at https://github.com/python/cpython/blob/master/Lib/heapq.py#L395 works somewhat differently.
If the k is greater than or equal to the number of items in the list, then a sorted list containing all of the elements is returned. Otherwise, it uses a standard heap selection algorithm:
create a max heap from the first k items
for each remaining item
if the item is smaller than the largest item on the heap
remove the largest item from the heap
add the new item to the heap
sort the resulting heap and return
The remove/add is combined into a single function called heap_replace.
There's an optimization in there to use the standard comparator if the key is None, but it uses the same basic heap selection algorithm.
This implementation is much more efficient than the other one that I described, although I expect it to be slower than Quickselect in the general case.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With