Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I get n largest lists from a list of lists in python

Tags:

python

I am using heapq to get nlargest elements from list of lists. The program I wrote is below.

import csv
import heapq
f = open("E:/output.csv","r")
read = csv.reader(f)

allrows = [row for row in read]

for i in xrange(0,2):
     print allrows[i]

allrows.sort(key=lambda x: x[2]) #this is working properly
it=heapq.nlargest(20,enumerate(allrows),key=lambda x:x[2]) #error

I just want the top 20 elements. So Instead of sorting I thought of using a heap. The error I am getting is,

  Traceback (most recent call last):
File "D:\eclipse_progs\DaDv\IMDB\Assignment1.py", line 42, in <module>
  it=heapq.nlargest(2,enumerate(allrows),key=lambda x:x[2])
File "C:\Python27\lib\heapq.py", line 470, in nlargest
  result = _nlargest(n, it)
File "D:\eclipse_progs\DaDv\IMDB\Assignment1.py", line 42, in <lambda>
  it=heapq.nlargest(2,enumerate(allrows),key=lambda x:x[2])
IndexError: tuple index out of range

Can I know why I am getting the error and how to resolve it. Is there any property of using heapq I am missing.

like image 388
WannaBeCoder Avatar asked Nov 28 '14 07:11

WannaBeCoder


2 Answers

enumerate() returns an iterable over 2-tuples. Thus accessing x[2] in your second example is always going to be out of range (the only valid indices are 0 and 1).

To make the second example equivalent to the first, you should be passing allrows directly instead of using enumerate():

it = heapq.nlargest(20, allrows, key=lambda x:x[2])

If you need to preserve the original indices, enumerate() is the way to go. However, you also need an additional level of indirection in the key function:

it = heapq.nlargest(20, enumerate(allrows), key=lambda x:x[1][2]) 
                        ^^^^^^^^^                         ^^^
like image 159
NPE Avatar answered Sep 28 '22 16:09

NPE


Thanks NPE for lighting the problem , As an alternative answer you can concatenate all of your rows with itertools.chain() and get top 20 element with sorting , that have more performance than heapq :

from itertools import chain

sorted(chain(*allrows))[-20:]

The nlargest() and nsmallest() functions are most appropriate if you are trying to find a relatively small number of items. If you are simply trying to find the single smallest or largest item (N=1), it is faster to use min() and max() . Similarly, if N is about the same size as the collection itself, it is usually faster to sort it first and take a slice (i.e., use sorted(items)[:N] or sorted(items)[-N:] ).

like image 29
Mazdak Avatar answered Sep 28 '22 17:09

Mazdak