Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get top biggest values from each column of the pandas.DataFrame

Here is my pandas.DataFrame:

import pandas as pd
data = pd.DataFrame({
  'first': [40, 32, 56, 12, 89],
  'second': [13, 45, 76, 19, 45],
  'third': [98, 56, 87, 12, 67]
}, index = ['first', 'second', 'third', 'fourth', 'fifth'])

I want to create a new DataFrame that will contain top 3 values from each column of my data DataFrame.

Here is an expected output:

   first  second  third
0     89      76     98
1     56      45     87
2     40      45     67

How can I do that?

like image 874
Michael Avatar asked Dec 09 '13 17:12

Michael


3 Answers

Create a function to return the top three values of a series:

def sorted(s, num):
    tmp = s.sort_values(ascending=False)[:num]  # earlier s.order(..)
    tmp.index = range(num)
    return tmp

Apply it to your data set:

In [1]: data.apply(lambda x: sorted(x, 3))
Out[1]:
   first  second  third
0     89      76     98
1     56      45     87
2     40      45     67
like image 116
Zelazny7 Avatar answered Sep 22 '22 09:09

Zelazny7


With numpy you can get array of top-3 values along columns like follows:

>>> import numpy as np
>>> col_ind = np.argsort(data.values, axis=0)[::-1,:]
>>> ind_to_take = col_ind[:3,:] + np.arange(data.shape[1])*data.shape[0]
>>> np.take(data.values.T, ind_to_take)
array([[89, 76, 98],
       [56, 45, 87],
       [40, 45, 67]], dtype=int64)

You can convert back to DataFrame:

>>> pd.DataFrame(_, columns = data.columns, index=data.index[:3])
       first  second  third
One       89      76     98
Two       56      45     87
Three     40      45     67
like image 30
alko Avatar answered Sep 21 '22 09:09

alko


The other solutions (at the time of writing this), sort the DataFrame with super-linear complexity per column, but it can actually be done with linear time per column.

first, numpy.partition partitions the k smallest elements at the k first positions (unsorted otherwise). To get the k largest elements, we can use

import numpy as np

-np.partition(-v, k)[: k]

Combining this with dictionary comprehension, we can use:

>>> pd.DataFrame({c: -np.partition(-data[c], 3)[: 3] for c in data.columns})
    first   second  third
0   89  76  98
1   56  45  87
2   40  45  67
like image 26
Ami Tavory Avatar answered Sep 22 '22 09:09

Ami Tavory