Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the time complexity of heapq.nlargest?

I was looking at this pycon talk, 34:30 and the speaker says that getting the t largest elements of a list of n elements can be done in O(t + n).

How is that possible? My understanding is that creating the heap will be O(n), but what's the complexity of nlargest itself, is it O(n + t) or O(t) (and what's the actual algorithm)?

like image 356
foo Avatar asked Apr 13 '14 03:04

foo


People also ask

What is time complexity Heapq?

heapq is a binary heap, with O(log n) push and O(log n) pop . See the heapq source code. The algorithm you show takes O(n log n) to push all the items onto the heap, and then O((n-k) log n) to find the kth largest element. So the complexity would be O(n log n).

What is Heapq Nlargest in Python?

The nlargest() function of the Python module heapq returns the specified number of largest elements from a Python iterable like a list, tuple and others. The function nlargest() can also be passed a key function that returns a comparison key to be used in the sorting.

Is Heapq fast?

The heapq is faster than sorted in case if you need to add elements on the fly i.e. additions and insertions could come in unspecified order. Adding new element preserving inner order in any heap is faster than resorting array after each insertion.

Is Heapq always sorted?

The condition on heapq is not a "sort guarantee" over the provided list.


1 Answers

The speaker is wrong in this case. The actual cost is O(n * log(t)). Heapify is called only on the first t elements of the iterable. That's O(t), but is insignificant if t is much smaller than n. Then all the remaining elements are added to this "little heap" via heappushpop, one at a time. That takes O(log(t)) time per invocation of heappushpop. The length of the heap remains t throughout. At the very end, the heap is sorted, which costs O(t * log(t)), but that's also insignificant if t is much smaller than n.

Fun with Theory ;-)

There are reasonably easy ways to find the t'th-largest element in expected O(n) time; for example, see here. There are harder ways to do it in worst-case O(n) time. Then, in another pass over the input, you could output the t elements >= the t-th largest (with tedious complications in case of duplicates). So the whole job can be done in O(n) time.

But those ways require O(n) memory too. Python doesn't use them. An advantage of what's actually implemented is that the worst-case "extra" memory burden is O(t), and that can be very significant when the input is, for example, a generator producing a great many values.

like image 170
Tim Peters Avatar answered Sep 20 '22 15:09

Tim Peters