Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find the column names which have top 3 largest values for each row

Tags:

python

For example the data look like:

df={'a1':[5,6,3,2,5],'a2':[23,43,56,2,6], 'a3':[4,2,3,6,7], 'a4':[1,2,1,3,2],'a5':[4,98,23,5,7],'a6':[5,43,3,2,5]}
x=pd.DataFrame(df)
Out[260]: 
    a1  a2  a3  a4  a5  a6
0   5  23   4   1   4   5
1   6  43   2   2   98   43
2   3  56   3   1  23   3
3   2   2   6   3   5   2
4   5   6   7   2   7   5

I need the result to look like:

top1 top2 top3
a2   a1   a6
a5   a2   a6
....

I've seen answer to a previous questions (see below) that recommends idxmax. But how to handle top n values (n>1)?

Find the column name which has the maximum value for each row

Update:

I find the answer very useful but the only thing is that my data is long so have to figure out a way to bypass that. I ended up saving the data to a csv file and then reading it back in in chunks. here is the code I used:

data = pd.read_csv('xxx.csv', chunksize=1000)
rslt = pd.DataFrame(np.zeros((0,3)), columns=['top1','top2','top3'])
for chunk in data:
    x=pd.DataFrame(chunk).T
    for i in x.columns:
        df1row = pd.DataFrame(x.nlargest(3, i).index.tolist(), index=['top1','top2','top3']).T
        rslt = pd.concat([rslt, df1row], axis=0)
rslt=rslt.reset_index(drop=True)
like image 836
CWeeks Avatar asked May 28 '16 03:05

CWeeks


People also ask

How do you find the columns maximum value in every row?

To create the new column 'Max', use df['Max'] = df. idxmax(axis=1) . To find the row index at which the maximum value occurs in each column, use df. idxmax() (or equivalently df.

How do you find the top 5 rows in a data frame?

DataFrame. head(n) to get the first n rows of the DataFrame. It takes one optional argument n (number of rows you want to get from the start). By default n = 5, it return first 5 rows if value of n is not passed to the method.

How do you find the max value in a column in Python?

To find the maximum value of a column and to return its corresponding row values in Pandas, we can use df. loc[df[col]. idxmax()].


3 Answers

import pandas as pd
import numpy as np

df={'a1':[5,6,3,2,5],'a2':[23,43,56,2,6], 'a3':[4,2,3,6,7], 'a4':[1,2,1,3,2],'a5':[4,98,23,5,7],'a6':[5,43,3,2,5]}
df=pd.DataFrame(df)

df


   a1  a2  a3  a4  a5  a6
0   5  23   4   1   4   5
1   6  43   2   2  98  43
2   3  56   3   1  23   3
3   2   2   6   3   5   2
4   5   6   7   2   7   5

We can solve it using the argsortfrom numpy and apply , lambda from pandas. The solution:

Tops =pd.DataFrame(df.apply(lambda x:list(df.columns[np.array(x).argsort()[::-1][:3]]), axis=1).to_list(),  columns=['Top1', 'Top2', 'Top3'])


Tops

And we get:

  Top1 Top2 Top3
0   a2   a6   a1
1   a5   a6   a2
2   a2   a5   a6
3   a3   a5   a4
4   a5   a3   a2
like image 65
George Pipis Avatar answered Oct 26 '22 20:10

George Pipis


What you need is pandas.DataFrame.nlargest.

import pandas as pd
import numpy as np

df={'a1':[5,6,3,2,5],'a2':[23,43,56,2,6], 'a3':[4,2,3,6,7], 'a4':[1,2,1,3,2],'a5':[4,98,23,5,7],'a6':[5,43,3,2,5]}

x=pd.DataFrame(df).T

rslt = pd.DataFrame(np.zeros((0,3)), columns=['top1','top2','top3'])
for i in x.columns:
    df1row = pd.DataFrame(x.nlargest(3, i).index.tolist(), index=['top1','top2','top3']).T
    rslt = pd.concat([rslt, df1row], axis=0)

print rslt

Out[52]: 
  top1 top2 top3
0   a2   a1   a6
0   a5   a2   a6
0   a2   a5   a1
0   a3   a5   a4
0   a3   a5   a2
like image 25
2342G456DI8 Avatar answered Oct 26 '22 19:10

2342G456DI8


You can do it like this:

x.T.apply(lambda x: x.sort_values(ascending=False).index).T.filter(['a1','a2','a3']).rename(columns={"a1":'top1',"a2":'top2',"a3":'top3'})

Results:

  top1 top2 top3
0   a2  a6  a1
1   a5  a6  a2
2   a2  a5  a6
3   a3  a5  a4
4   a5  a3  a2
like image 25
Billy Bonaros Avatar answered Oct 26 '22 19:10

Billy Bonaros