Getting the three smallest values per row and returning the correspondent column names

Tags:

I have two dataframe, df and df2,they are correspondent. Now based in the first dataframe df, I want to get the 3 smallest value in one row and return the correspondent column's name(in this case like "X"or"Y"or"Z"or"T"). So I can get the new dataframe df3.

df = pd.DataFrame({
        'X': [21, 2, 43, 44, 56, 67, 7, 38, 29, 130],
        'Y': [101, 220, 330, 140, 250, 10, 207, 320, 420, 50],
        'Z': [20, 128, 136, 144, 312, 10, 82, 63, 42, 12],
        'T': [2, 32, 4, 424, 256, 167, 27, 38, 229, 30]
    }, index=list('ABCDEFGHIJ'))

df2 = pd.DataFrame({
        'X': [0.5, 0.12,0.43, 0.424, 0.65,0.867,0.17,0.938,0.229,0.113],
        'Y': [0.1,2.201,0.33,0.140,0.525,0.31,0.20,0.32,0.420,0.650],
        'Z': [0.20,0.128,0.136,0.2144,0.5312,0.61,0.82,0.363,0.542,0.512],
        'T':[0.52, 0.232,0.34, 0.6424, 0.6256,0.3167,0.527,0.38,0.4229,0.73]
    },index=list('ABCDEFGHIJ'))

Besides that, I want to get another dataframe df4 which is correspondent from df3 in df2 which means in df row['A'] (2,20,21) is the 3 smallest value, so in df4 row['A'], I want to get (0.52,0.2,0.5) from df2.

367

asked Sep 05 '17 05:09

Hong

2 Answers

You can use if both DataFrames has same columns names in same order argsort for indices:

arr = df.values.argsort(1)[:,:3]
print (arr)
[[0 3 1]
 [1 0 3]
 [0 1 3]
 [1 2 3]
 [1 2 0]
 [2 3 1]
 [1 0 3]
 [0 1 3]
 [1 3 0]
 [3 0 2]]

#get values by indices in arr 
b = df2.values[np.arange(len(arr))[:,None], arr]
print (b)
[[ 0.52    0.2     0.5   ]
 [ 0.12    0.232   0.128 ]
 [ 0.34    0.43    0.136 ]
 [ 0.424   0.14    0.2144]
 [ 0.65    0.525   0.6256]
 [ 0.31    0.61    0.867 ]
 [ 0.17    0.527   0.82  ]
 [ 0.38    0.938   0.363 ]
 [ 0.229   0.542   0.4229]
 [ 0.512   0.73    0.65  ]]

Last use DataFrame constructors:

df3 = pd.DataFrame(df.columns[arr])
df3.columns = ['Col{}'.format(x+1) for x in df3.columns]
print (df3)
  Col1 Col2 Col3
0    T    Z    X
1    X    T    Z
2    T    X    Z
3    X    Y    Z
4    X    Y    T
5    Y    Z    X
6    X    T    Z
7    T    X    Z
8    X    Z    T
9    Z    T    Y

df4 = pd.DataFrame(b)
df4.columns = ['Col{}'.format(x+1) for x in df4.columns]
print (df4)
    Col1   Col2    Col3
0  0.520  0.200  0.5000
1  0.120  0.232  0.1280
2  0.340  0.430  0.1360
3  0.424  0.140  0.2144
4  0.650  0.525  0.6256
5  0.310  0.610  0.8670
6  0.170  0.527  0.8200
7  0.380  0.938  0.3630
8  0.229  0.542  0.4229
9  0.512  0.730  0.6500

Answers are similar, so I create timings:

np.random.seed(14)
N = 1000000
df1 = pd.DataFrame(np.random.randint(100, size=(N, 4)), columns=['X','Y','Z','T'])
#print (df1)

df1 = pd.DataFrame(np.random.rand(N, 4), columns=['X','Y','Z','T'])
#print (df1)


def jez():
    arr = df.values.argsort(1)[:,:3]
    b = df2.values[np.arange(len(arr))[:,None], arr]
    df3 = pd.DataFrame(df.columns[arr])
    df3.columns = ['Col{}'.format(x+1) for x in df3.columns]
    df4 = pd.DataFrame(b)
    df4.columns = ['Col{}'.format(x+1) for x in df4.columns]


def pir():
    v = df.values
    a = v.argpartition(3, 1)[:, :3]
    c = df.columns.values[a]
    pd.DataFrame(c, df.index)
    d = df2.values[np.arange(len(df))[:, None], a]
    pd.DataFrame(d, df.index, [1, 2, 3]).add_prefix('Col')

def cᴏʟᴅsᴘᴇᴇᴅ():
    #another solution is wrong
    df3 = df.apply(lambda x: df.columns[np.argsort(x)], 1).iloc[:, :3]
    pd.DataFrame({'Col{}'.format(i + 1) : df2.lookup(df3.index, df3.iloc[:, i]) for i in range(df3.shape[1])}, index=df.index)


print (jez())
print (pir())
print (cᴏʟᴅsᴘᴇᴇᴅ())

In [176]: %timeit (jez())
1000 loops, best of 3: 412 µs per loop

In [177]: %timeit (pir())
1000 loops, best of 3: 425 µs per loop

In [178]: %timeit (cᴏʟᴅsᴘᴇᴇᴅ())
100 loops, best of 3: 3.99 ms per loop

163

answered Oct 01 '22 15:10

jezrael

I'd use numpy.argpartition as it only looks to partition each row into bottom k and the rest. Its time complexity is O(n) rather that O(nlogn) due to not needing to sort completely.

v = df.values
m = v.shape[1]

a = v.argpartition(3, 1)[:, :3]

c = df.columns.values[a]

We can define df3 based on this.

df3 = pd.DataFrame(c, df.index)

df3

   0  1  2
A  T  Z  X
B  X  T  Z
C  T  X  Z
D  Y  X  Z
E  Y  X  T
F  Y  Z  X
G  X  T  Z
H  X  T  Z
I  X  Z  T
J  Z  T  Y

You can use this to creat df4

d = df2.values[np.arange(len(df))[:, None], a]
df4 = pd.DataFrame(d, df.index, [1, 2, 3]).add_prefix('Col')
df4

    Col1   Col2    Col3
A  0.520  0.200  0.5000
B  0.120  0.232  0.1280
C  0.340  0.430  0.1360
D  0.140  0.424  0.2144
E  0.525  0.650  0.6256
F  0.310  0.610  0.8670
G  0.170  0.527  0.8200
H  0.938  0.380  0.3630
I  0.229  0.542  0.4229
J  0.512  0.730  0.6500

answered Oct 01 '22 15:10

piRSquared

Related questions
                            
                                passing pandas dataframe into a python subprocess.Popen as an argument
                            
                                keras reshape input image to work with CNN
                            
                                Django .only() causing maximum recursion depth exceeded error?
                            
                                Python - Sqlite - OperationalError: near "s": syntax error [duplicate]
                            
                                Removing 'overlapping' dates from pandas dataframe
                            
                                How to calculate 3D distance (including altitude) between two points in GeoDjango
                            
                                How to allow non-admin users to authenticate via OAuth2.0 for tenants where users are not allowed to consent apps on their behalf?
                            
                                What's wrong with this implementation of quicksort?
                            
                                how to split and concat pandas dataframe
                            
                                Correct use of map for mapping a function onto a df, python pandas
                            
                                Is there any python package for parsing pkcs7?
                            
                                How to count accesses per hour from log file entries?
                            
                                __slots__ conflicts with a class variable in a generic class
                            
                                Strange error with Keras and Spyder
                            
                                How to rotate an element in Holoviews
                            
                                Are the async/await keywords in python 3.5 inspired by async/await in C#? [closed]
                            
                                Replace a list of numbers with flat sub-ranges
                            
                                How to save OpenCV image with contour
                            
                                Using Chardet to find encoding of very large file
                            
                                Line hover text in Plotly

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Getting the three smallest values per row and returning the correspondent column names

Tags:

python

indexing

pandas

dataframe

Hong

People also ask

2 Answers

jezrael

piRSquared

Recent Activity

Donate For Us