Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to keep the only the top N values in a dataframe

It is pandas/Dataframe, for every row, I want to keep only the top N (N=3) values and set others to nan,

import pandas as pd
import numpy as np

data = np.array([['','day1','day2','day3','day4','day5'],
                ['larry',1,4,4,3,5],
                ['gunnar',2,-1,3,4,4],
                ['tin',-2,5,5, 6,7]])
                
df = pd.DataFrame(data=data[1:,1:],
                  index=data[1:,0],
                  columns=data[0,1:])
print(df) 

output is

       day1 day2 day3 day4 day5
larry     1    4    4    3    5
gunnar    2   -1    3    4    4
tin      -2    5    5    6    7

I want to get

       day1 day2 day3 day4 day5
larry   NaN    4    4  NaN    5
gunnar  NaN  NaN    3    4    4
tin     NaN    5  NaN    6    7

Similar to pandas: Keep only top n values and set others to 0, but I need to keep only N highest available values, otherwise the average is not correct

For the result above I want to keep first 5 only

like image 323
Larry Cai Avatar asked Feb 04 '21 15:02

Larry Cai


1 Answers

You can use np.unique to sort and find the 5th largest value, and use where:

uniques = np.unique(df)

# what happens if len(uniques) < 5?
thresh = uniques[-5]
df.where(df >= thresh)

Output:

        day1  day2  day3  day4  day5
larry    NaN   4.0     4     3     5
gunnar   NaN   NaN     3     4     4
tin      NaN   5.0     5     6     7

Update: On the second look, I think you can do:

df.apply(pd.Series.nlargest, n=3,axis=1).reindex(df.columns, axis=1)

Output:

        day1  day2  day3  day4  day5
larry    NaN   4.0   4.0   NaN   5.0
gunnar   NaN   NaN   3.0   4.0   4.0
tin      NaN   5.0   NaN   6.0   7.0
like image 138
Quang Hoang Avatar answered Sep 18 '22 16:09

Quang Hoang