I have a DataFrame:
>>> df = pd.DataFrame({'row1' : [1,2,np.nan,4,5], 'row2' : [11,12,13,14,np.nan], 'row3':[22,22,23,24,25]}, index = 'a b c d e'.split()).T
>>> df
         a     b     c     d     e
row1   1.0   2.0   NaN   4.0   5.0
row2  11.0  12.0  13.0  14.0   NaN
row3  22.0  22.0  23.0  24.0  25.0
and a Series that specifies the number of top N values I want from each row
>>> n_max = pd.Series([2,3,4])
What is Panda's way of using df and n_max to find the largest N elements of each (breaking ties with a random pick, just as .nlargest() would do)?
The desired output is
         a     b     c     d     e
row1   NaN   NaN   NaN   4.0   5.0
row2   NaN  12.0  13.0  14.0   NaN
row3  22.0   NaN  23.0  24.0  25.0
I know how to do this with a uniform/fixed N across all rows (say, N=4). Note the tie-breaking in row3:
>>> df.stack().groupby(level=0).nlargest(4).unstack().reset_index(level=1, drop=True).reindex(columns=df.columns)
         a     b     c     d     e
row1   1.0   2.0   NaN   4.0   5.0
row2  11.0  12.0  13.0  14.0   NaN
row3  22.0   NaN  23.0  24.0  25.0
But the goal, again, is to have row-specific N. Looping through each row obviously doesn't count (for performance reasons). And I've tried using .rank() with a mask but tie breaking doesn't work there...
Based on @ScottBoston's comment on the OP, it is possible to use the following mask based on rank to solve this problem:
>>> n_max.index = df.index
>>> df_rank = df.stack(dropna=False).groupby(level=0).rank(ascending=False, method='first').unstack()
>>> selected = df_rank.le(n_max, axis=0)
>>> df[selected]
         a     b     c     d     e
row1   NaN   NaN   NaN   4.0   5.0
row2   NaN  12.0  13.0  14.0   NaN
row3  22.0   NaN  23.0  24.0  25.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With