Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a way to sort a list in python until the first sorted k elements are found?

I have a normal boring list of non sorted numbers. From that list I need to take the first k elements after sorting. The thing is that if the list is considerably long and k is considerably small sorting the entire list seems like a waste. I came up with an algorithmic solution for this, but requires me to write my own implementation for sorting, my question is: is there a way to get the same efficiency using something already implemented in python?

UPDATE:

Just to clarify, I know this will give the answer I need: sorted(boring_list)[:n]

But my concern is efficiency: I don't need to sort the whole list for this.

like image 858
Sofia Bravo Avatar asked Mar 03 '14 22:03

Sofia Bravo


2 Answers

You can use the heapq module, in particular its nlargest or nsmallest functions.

Alternatively just build the heap and call heappop(). This should take O(n) time to build the heap and O(k*log(n)) to retrieve the k elements.


Here's a very simple and small benchmark:

In [1]: import random, heapq

In [2]: seq = [random.randint(-5000, 5000) for _ in range(35000)]

In [3]: %timeit sorted(seq)[:75]
100 loops, best of 3: 14.5 ms per loop

In [4]: %%timeit
   ...: s = seq[:]
   ...: heapq.nsmallest(75, s)
   ...: 
100 loops, best of 3: 4.05 ms per loop

In [5]: %%timeit
   ...: s = seq[:]
   ...: heapq.heapify(s)
   ...: for _ in range(75): heapq.heappop(s)
   ...: 
100 loops, best of 3: 2.41 ms per loop

I have no idea why nsmallest is so much slower then calling heappop directly. In fact I should have timed it without copying seq but still:

In [6]: %%timeit
   ...: heapq.nsmallest(75, seq)
   ...: 
100 loops, best of 3: 3.82 ms per loop

Increasing the length by 100 times:

In [12]: %timeit sorted(seq)[:75]
1 loops, best of 3: 1.9 s per loop

In [13]: %%timeit
    ...: heapq.nsmallest(75, seq)
    ...: 
1 loops, best of 3: 352 ms per loop

In [14]: %%timeit
    ...: s = seq[:]
    ...: heapq.heapify(s)
    ...: for _ in range(75): heapq.heappop(s)
    ...: 
1 loops, best of 3: 264 ms per loop

Note: to counter F.J biased profiling:

In [13]: a = list(range(1000000))

In [14]: random.shuffle(a)

In [15]: %timeit sorted(a)
1 loops, best of 3: 985 ms per loop

In [16]: %%timeit
    ...: s = a[:]
    ...: heapq.heapify(s)
    ...: 
1 loops, best of 3: 284 ms per loop

As you can see heapify is quite faster then sorting even on 1000000 elements lists.

like image 51
Bakuriu Avatar answered Sep 30 '22 07:09

Bakuriu


Use heapq.nsmallest.

Maintaining the heap invariant is O(logk) where k is the size of the heap; you have to perform n push operations, making the overall complexity O(n logk). Compare this to sorting-and-taking-the-first-k-elements, which is overall complexity O(n logn). When k is small compared to n, the heapq approach clearly wins.

When k approaches n, you should instead just sort and take the first k - timsort is really good :-)

like image 21
roippi Avatar answered Sep 30 '22 06:09

roippi