Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: Idxmax, best n results

Tags:

python

pandas

I'm doing principal component analysis and have the following type of components result:

In [140]: components.head()
Out[140]: 
        V52      V134      V195      V205       V82      V189       V10  \
0  0.070309  0.043759 -0.102138  0.442536 -0.010881  0.041344 -0.001451   
1  0.004664  0.313388 -0.140883  0.015051  0.023085  0.055634  0.065873   
2  0.028201 -0.116513 -0.135300 -0.092226 -0.009306  0.065079 -0.030595   
3  0.018049 -0.136013  0.073010 -0.076940  0.013245 -0.010582  0.065641   

        V47      V177      V184    ...         V208        V5      V133  \
0  0.066203  0.016056  0.105487    ...    -0.144894 -0.009810  0.117964   
1 -0.009324  0.008935 -0.044760    ...    -0.014553 -0.014208  0.200632   
2  0.013799  0.169503 -0.010660    ...    -0.079821 -0.053905  0.080867   
3 -0.023983  0.111241 -0.058065    ...    -0.061059  0.023443 -0.080217   

       V182        V7      V165       V66      V110      V163      V111  
0  0.105744  0.021426 -0.024762  0.021677  0.022448 -0.055235  0.031443  
1 -0.013170  0.050605  0.039877 -0.009789  0.031876  0.030285  0.021022  
2  0.046810 -0.046136  0.029483 -0.009503  0.027325  0.029591  0.028920  
3 -0.019632  0.023725 -0.038712  0.024930  0.063177 -0.057635  0.067163 

Now, for each component, I would like to get the n columns with the highest absolute number. I can do the following when n == 1:

In [143]: components.abs().idxmax(axis=1)
Out[143]: 
0    V205
1     V98
2    V137
3     V23
dtype: object

But what can I do for n > 1?

like image 281
FooBar Avatar asked Mar 08 '16 15:03

FooBar


1 Answers

You can use the nlargest method.

n = 5
cols = df.columns
df.nlargest(n, cols)

np.random.seed(0)
df = pd.DataFrame(np.random.randn(5, 3), columns=list('ABC'))

>>> df
          A         B         C
0  1.764052  0.400157  0.978738
1  2.240893  1.867558 -0.977278
2  0.950088 -0.151357 -0.103219
3  0.410599  0.144044  1.454274
4  0.761038  0.121675  0.443863

>>> df.nlargest(3, df.columns)
          A         B         C
1  2.240893  1.867558 -0.977278
0  1.764052  0.400157  0.978738
2  0.950088 -0.151357 -0.103219

To get the the top two columns with the highest absolute values:

n = 2
>>> df.apply(lambda s: s.abs()).max().nlargest(n)
A    2.240893
B    1.867558
dtype: float64

To get the two column names corresponding to the highest absolute value for each row:

df.apply(lambda s: s.abs().nlargest(2).index.tolist(), axis=1)
0    [A, C]
1    [A, B]
2    [A, B]
3    [C, A]
4    [A, C]
dtype: object
like image 91
Alexander Avatar answered Nov 03 '22 09:11

Alexander