Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

return n smallest indexes by column using pandas

Tags:

python

pandas

I have the following (simplified) dataframe:

df = pd.DataFrame({'X': [1, 2, 3, 4, 5,6,7,8,9,10],
'Y': [10,20,30,40,50,-10,-20,-30,-40,-50],
'Z': [20,18,16,14,12,10,8,6,4,2]},index=list('ABCDEFGHIJ'))

Which gives the following:

    X   Y   Z
A   1  10  20
B   2  20  18
C   3  30  16
D   4  40  14
E   5  50  12
F   6 -10  10
G   7 -20   8
H   8 -30   6
I   9 -40   4
J  10 -50   2

I want to create a new dataframe that returns the index of the n smallest values, by column.

Desired output (say, 3 smallest values):

   X  Y  Z
0  A  J  J
1  B  I  I
2  C  H  H

What is the best way to do this?

like image 679
Nick Morgan Avatar asked Jul 19 '17 14:07

Nick Morgan


2 Answers

You can use apply with nsmallest:

n = 3
df.apply(lambda x: pd.Series(x.nsmallest(n).index))

#   X   Y   Z
#0  A   J   J
#1  B   I   I
#2  C   H   H
like image 184
Psidom Avatar answered Sep 22 '22 06:09

Psidom


Faster numpy solution with numpy.argsort:

N = 3
a = np.argsort(-df.values, axis=0)[-1:-1-N:-1]
print (a)
[[0 9 9]
 [1 8 8]
 [2 7 7]]

b = pd.DataFrame(df.index[a], columns=df.columns)
print (b)
   X  Y  Z
0  A  J  J
1  B  I  I
2  C  H  H

Timings:

In [111]: %timeit (pd.DataFrame(df.index[np.argsort(-df.values, axis=0)[-1:-1-N:-1]], columns=df.columns))
159 µs ± 1.37 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [112]: %timeit (df.apply(lambda x: pd.Series(x.nsmallest(N).index)))
3.52 ms ± 49.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
like image 25
jezrael Avatar answered Sep 25 '22 06:09

jezrael