Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

fast way to get index of top-k elements of every column in a pandas dataframe

I have a very large pandas dataframe with approximately 500,000 columns. Each column is about 500 elements long. For each column, I need to retrieve the (index, column) location of the top-k elements in the column.

So, if k were equal to 2, and this were my data frame:

  A  B  C  D
w 4  8  10 2
x 5  1  1  6 
y 9  22 25 7 
z 15 5  7  2

I would want to return:

[(A,y),(A,z),(B,w),(B,y),(C,w),(C,y),(D,x),(D,y)]

Keep in mind that I have approximately 500,000 columns, so speed is my primary concern. Is there a reasonable way of doing this that will not take an entire week on my machine? What is the fastest way - even if it will be fast enough for the amount of data I have?

Thanks for the help!

like image 276
A. Arpi Avatar asked Aug 24 '15 18:08

A. Arpi


1 Answers

I think numpy has a good solution for this that's fast and you can format the output however you want.

In [2]: df = pd.DataFrame(data=np.random.randint(0, 1000, (200, 500000)), 
                      columns=range(500000), index=range(200))

In [3]: def top_k(x,k):
             ind=np.argpartition(x,-1*k)[-1*k:]
             return ind[np.argsort(x[ind])]

In [69]: %time np.apply_along_axis(lambda x: top_k(x,2),0,df.as_matrix())
CPU times: user 5.91 s, sys: 40.7 ms, total: 5.95 s
Wall time: 6 s

Out[69]:
array([[ 14,  54],
       [178, 141],
       [ 49, 111],
       ...,
       [ 24, 122],
       [ 55,  89],
       [  9, 175]])

Pretty fast compared to the pandas solution (which is cleaner IMO but we're going for speed here):

In [41]: %time np.array([df[c].nlargest(2).index.values for c in df])
CPU times: user 3min 43s, sys: 6.58 s, total: 3min 49s
Wall time: 4min 8s

Out[41]:
array([[ 54,  14],
       [141, 178],
       [111,  49],
       ...,
       [122,  24],
       [ 89,  55],
       [175,   9]])

The lists are in reverse order of each other (you can easily fix this by reversing sort in the numpy version)

Note that in the example due to random int generation we can likely have more than k values that are equal and max so indices returned may not agree among all methods but all will yield a valid result (you will get k indices that match the max values in the column)

like image 147
cwharland Avatar answered Oct 25 '22 13:10

cwharland