I was looking at this pycon talk, 34:30 and the speaker says that getting the <code>t</code> largest elements of a list of <code>n</code> elements can be done in <code>O(t + n)</code>. How is that possible? My understanding is that creating the heap will be <code>O(n)</code>, but what's the complexity of <code>nlargest</code> itself, is it <code>O(n + t)</code> or <code>O(t)</code> (and what's the actual algorithm)?

The speaker is wrong in this case. The actual cost is <code>O(n * log(t))</code>. Heapify is called only on the first <code>t</code> elements of the iterable. That's <code>O(t)</code>, but is insignificant if <code>t</code> is much smaller than <code>n</code>. Then all the remaining elements are added to this "little heap" via <code>heappushpop</code>, one at a time. That takes <code>O(log(t))</code> time per invocation of <code>heappushpop</code>. The length of the heap remains <code>t</code> throughout. At the very end, the heap is sorted, which costs <code>O(t * log(t))</code>, but that's also insignificant if <code>t</code> is much smaller than <code>n</code>. <h3>Fun with Theory ;-)</h3> There are reasonably easy ways to find the t'th-largest element in expected <code>O(n)</code> time; for example, see here. There are harder ways to do it in worst-case <code>O(n)</code> time. Then, in another pass over the input, you could output the <code>t</code> elements >= the t-th largest (with tedious complications in case of duplicates). So the whole job can be done in <code>O(n)</code> time. But those ways require <code>O(n)</code> memory too. Python doesn't use them. An advantage of what's actually implemented is that the worst-case "extra" memory burden is <code>O(t)</code>, and that can be very significant when the input is, for example, a generator producing a great many values.

What is the time complexity of heapq.nlargest?

1 Answers

The speaker is wrong in this case. The actual cost is O(n * log(t)). Heapify is called only on the first t elements of the iterable. That's O(t), but is insignificant if t is much smaller than n. Then all the remaining elements are added to this "little heap" via heappushpop, one at a time. That takes O(log(t)) time per invocation of heappushpop. The length of the heap remains t throughout. At the very end, the heap is sorted, which costs O(t * log(t)), but that's also insignificant if t is much smaller than n.

Fun with Theory ;-)

There are reasonably easy ways to find the t'th-largest element in expected O(n) time; for example, see here. There are harder ways to do it in worst-case O(n) time. Then, in another pass over the input, you could output the t elements >= the t-th largest (with tedious complications in case of duplicates). So the whole job can be done in O(n) time.

But those ways require O(n) memory too. Python doesn't use them. An advantage of what's actually implemented is that the worst-case "extra" memory burden is O(t), and that can be very significant when the input is, for example, a generator producing a great many values.

170

answered Sep 20 '22 15:09

Tim Peters

Related questions
                            
                                Django uploads: Discard uploaded duplicates, use existing file (md5 based check)
                            
                                Is there a Java equivalent to Python's Easy String Splicing?
                            
                                Creating a REST API for a Django application
                            
                                Django NoReverseMatch
                            
                                In Python, why can a lambda expression refer to the variable being defined but not a list?
                            
                                Finding indices of matches of one array in another array
                            
                                Cannot import keras after installation
                            
                                Why django urls end with a slash?
                            
                                Abstract classes with varying amounts of parameters
                            
                                Plotting multiple scatter plots pandas
                            
                                Is it possible to show a console in a Jupyter notebook?
                            
                                Tensor is not an element of this graph
                            
                                In Pandas, does .iloc method give a copy or view?
                            
                                How to do left outer join exclusion in pandas
                            
                                Read Unicode characters from command-line arguments in Python 2.x on Windows
                            
                                String manipulation in Cython
                            
                                Writing Python bindings for C++ code that use OpenCV
                            
                                Get full traceback
                            
                                How can I use executemany to insert into MySQL a list of dictionaries in Python
                            
                                Python logging - determine level number from name

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the time complexity of heapq.nlargest?

Tags:

python

algorithm

time-complexity

heap

foo

People also ask

1 Answers

Fun with Theory ;-)

Tim Peters

Recent Activity

Donate For Us