I have a large Pandas dataframe that is in the vein of:
| ID | Var1 | Var2 | Var3 | Var4 | Var5 |
|----|------|------|------|------|------|
| 1  | 1    | 2    | 3    | 4    | 5    |
| 2  | 10   | 9    | 8    | 7    | 6    |
| 3  | 25   | 37   | 41   | 24   | 21   |
| 4  | 102  | 11   | 72   | 56   | 151  |
...
and I would like to generate output that looks like this, taking the column names of the 3 highest values for each row:
| ID | 1st Max | 2nd Max | 3rd Max |
|----|---------|---------|---------|
| 1  | Var5    | Var4    | Var3    |
| 2  | Var1    | Var2    | Var3    |
| 3  | Var3    | Var2    | Var1    |
| 4  | Var5    | Var1    | Var3    |
...
I have tried using df.idmax(axis=1) which returns the 1st maximum column name but am unsure how to compute the other two?
Any help on this would be truly appreciated, thanks!
Use numpy.argsort for positions of sorted values with select top3 by indexing, last pass it to DataFrame constructor:
df = df.set_index('ID')
df = pd.DataFrame(df.columns.values[np.argsort(-df.values, axis=1)[:, :3]], 
                  index=df.index,
                  columns = ['1st Max','2nd Max','3rd Max']).reset_index()
print (df)
   ID 1st Max 2nd Max 3rd Max
0   1    Var5    Var4    Var3
1   2    Var1    Var2    Var3
2   3    Var3    Var2    Var1
3   4    Var5    Var1    Var3
Or if performance is not important use nlargest with apply per each row:
c = ['1st Max','2nd Max','3rd Max']
df = (df.set_index('ID')
        .apply(lambda x: pd.Series(x.nlargest(3).index, index=c), axis=1)
        .reset_index())
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With