I have a normal boring list of non sorted numbers. From that list I need to take the first k elements after sorting. The thing is that if the list is considerably long and k is considerably small sorting the entire list seems like a waste. I came up with an algorithmic solution for this, but requires me to write my own implementation for sorting, my question is: is there a way to get the same efficiency using something already implemented in python?
UPDATE:
Just to clarify, I know this will give the answer I need: sorted(boring_list)[:n]
But my concern is efficiency: I don't need to sort the whole list for this.
You can use the heapq
module, in particular its nlargest
or nsmallest
functions.
Alternatively just build the heap and call heappop()
. This should take O(n) time to build the heap and O(k*log(n)) to retrieve the k
elements.
Here's a very simple and small benchmark:
In [1]: import random, heapq
In [2]: seq = [random.randint(-5000, 5000) for _ in range(35000)]
In [3]: %timeit sorted(seq)[:75]
100 loops, best of 3: 14.5 ms per loop
In [4]: %%timeit
...: s = seq[:]
...: heapq.nsmallest(75, s)
...:
100 loops, best of 3: 4.05 ms per loop
In [5]: %%timeit
...: s = seq[:]
...: heapq.heapify(s)
...: for _ in range(75): heapq.heappop(s)
...:
100 loops, best of 3: 2.41 ms per loop
I have no idea why nsmallest
is so much slower then calling heappop
directly. In fact I should have timed it without copying seq
but still:
In [6]: %%timeit
...: heapq.nsmallest(75, seq)
...:
100 loops, best of 3: 3.82 ms per loop
Increasing the length by 100 times:
In [12]: %timeit sorted(seq)[:75]
1 loops, best of 3: 1.9 s per loop
In [13]: %%timeit
...: heapq.nsmallest(75, seq)
...:
1 loops, best of 3: 352 ms per loop
In [14]: %%timeit
...: s = seq[:]
...: heapq.heapify(s)
...: for _ in range(75): heapq.heappop(s)
...:
1 loops, best of 3: 264 ms per loop
Note: to counter F.J biased profiling:
In [13]: a = list(range(1000000))
In [14]: random.shuffle(a)
In [15]: %timeit sorted(a)
1 loops, best of 3: 985 ms per loop
In [16]: %%timeit
...: s = a[:]
...: heapq.heapify(s)
...:
1 loops, best of 3: 284 ms per loop
As you can see heapify
is quite faster then sorting even on 1000000 elements lists.
Use heapq.nsmallest
.
Maintaining the heap invariant is O(logk) where k is the size of the heap; you have to perform n push operations, making the overall complexity O(n logk). Compare this to sorting-and-taking-the-first-k-elements, which is overall complexity O(n logn). When k is small compared to n, the heapq approach clearly wins.
When k approaches n, you should instead just sort and take the first k - timsort is really good :-)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With