It is pandas/Dataframe, for every row, I want to keep only the top N (N=3) values and set others to nan
,
import pandas as pd
import numpy as np
data = np.array([['','day1','day2','day3','day4','day5'],
['larry',1,4,4,3,5],
['gunnar',2,-1,3,4,4],
['tin',-2,5,5, 6,7]])
df = pd.DataFrame(data=data[1:,1:],
index=data[1:,0],
columns=data[0,1:])
print(df)
output is
day1 day2 day3 day4 day5
larry 1 4 4 3 5
gunnar 2 -1 3 4 4
tin -2 5 5 6 7
I want to get
day1 day2 day3 day4 day5
larry NaN 4 4 NaN 5
gunnar NaN NaN 3 4 4
tin NaN 5 NaN 6 7
Similar to pandas: Keep only top n values and set others to 0, but I need to keep only N highest available values, otherwise the average is not correct
For the result above I want to keep first 5
only
You can use np.unique
to sort and find the 5th largest value, and use where
:
uniques = np.unique(df)
# what happens if len(uniques) < 5?
thresh = uniques[-5]
df.where(df >= thresh)
Output:
day1 day2 day3 day4 day5
larry NaN 4.0 4 3 5
gunnar NaN NaN 3 4 4
tin NaN 5.0 5 6 7
Update: On the second look, I think you can do:
df.apply(pd.Series.nlargest, n=3,axis=1).reindex(df.columns, axis=1)
Output:
day1 day2 day3 day4 day5
larry NaN 4.0 4.0 NaN 5.0
gunnar NaN NaN 3.0 4.0 4.0
tin NaN 5.0 NaN 6.0 7.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With