Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting the three smallest values per row and returning the correspondent column names

I have two dataframe, df and df2,they are correspondent. Now based in the first dataframe df, I want to get the 3 smallest value in one row and return the correspondent column's name(in this case like "X"or"Y"or"Z"or"T"). So I can get the new dataframe df3.

df = pd.DataFrame({
        'X': [21, 2, 43, 44, 56, 67, 7, 38, 29, 130],
        'Y': [101, 220, 330, 140, 250, 10, 207, 320, 420, 50],
        'Z': [20, 128, 136, 144, 312, 10, 82, 63, 42, 12],
        'T': [2, 32, 4, 424, 256, 167, 27, 38, 229, 30]
    }, index=list('ABCDEFGHIJ'))

df2 = pd.DataFrame({
        'X': [0.5, 0.12,0.43, 0.424, 0.65,0.867,0.17,0.938,0.229,0.113],
        'Y': [0.1,2.201,0.33,0.140,0.525,0.31,0.20,0.32,0.420,0.650],
        'Z': [0.20,0.128,0.136,0.2144,0.5312,0.61,0.82,0.363,0.542,0.512],
        'T':[0.52, 0.232,0.34, 0.6424, 0.6256,0.3167,0.527,0.38,0.4229,0.73]
    },index=list('ABCDEFGHIJ'))

Besides that, I want to get another dataframe df4 which is correspondent from df3 in df2 which means in df row['A'] (2,20,21) is the 3 smallest value, so in df4 row['A'], I want to get (0.52,0.2,0.5) from df2.

like image 367
Hong Avatar asked Sep 05 '17 05:09

Hong


People also ask

How to find minimum value for each row in python?

To find minimum value of every row in DataFrame just call the min() member function with DataFrame object with argument axis=1 i.e. It returned a series with row index label and minimum value of each row.

How to find minimum value of a column in pandas DataFrame?

Use min() function on a dataframe with 'axis = 1' attribute to find the minimum value over the row axis. 3) Get minimum values of every column without skipping None Value : Use min() function on a dataframe which has Na value with 'skipna = False' attribute to find the minimum value over the column axis.

How do you find the minimum value in a row?

If the cells are in a contiguous row or columnSelect a cell below or to the right of the numbers for which you want to find the smallest number. , click Min (calculates the smallest) or Max (calculates the largest), and then press ENTER.

What returns the number of values in a specific column?

Use the COUNT function in a formula to count the number of numeric values in a range. In the above example, A2, A3, and A6 are the only cells that contains numeric values in the range, hence the output is 3.


2 Answers

You can use if both DataFrames has same columns names in same order argsort for indices:

arr = df.values.argsort(1)[:,:3]
print (arr)
[[0 3 1]
 [1 0 3]
 [0 1 3]
 [1 2 3]
 [1 2 0]
 [2 3 1]
 [1 0 3]
 [0 1 3]
 [1 3 0]
 [3 0 2]]

#get values by indices in arr 
b = df2.values[np.arange(len(arr))[:,None], arr]
print (b)
[[ 0.52    0.2     0.5   ]
 [ 0.12    0.232   0.128 ]
 [ 0.34    0.43    0.136 ]
 [ 0.424   0.14    0.2144]
 [ 0.65    0.525   0.6256]
 [ 0.31    0.61    0.867 ]
 [ 0.17    0.527   0.82  ]
 [ 0.38    0.938   0.363 ]
 [ 0.229   0.542   0.4229]
 [ 0.512   0.73    0.65  ]]

Last use DataFrame constructors:

df3 = pd.DataFrame(df.columns[arr])
df3.columns = ['Col{}'.format(x+1) for x in df3.columns]
print (df3)
  Col1 Col2 Col3
0    T    Z    X
1    X    T    Z
2    T    X    Z
3    X    Y    Z
4    X    Y    T
5    Y    Z    X
6    X    T    Z
7    T    X    Z
8    X    Z    T
9    Z    T    Y

df4 = pd.DataFrame(b)
df4.columns = ['Col{}'.format(x+1) for x in df4.columns]
print (df4)
    Col1   Col2    Col3
0  0.520  0.200  0.5000
1  0.120  0.232  0.1280
2  0.340  0.430  0.1360
3  0.424  0.140  0.2144
4  0.650  0.525  0.6256
5  0.310  0.610  0.8670
6  0.170  0.527  0.8200
7  0.380  0.938  0.3630
8  0.229  0.542  0.4229
9  0.512  0.730  0.6500

Answers are similar, so I create timings:

np.random.seed(14)
N = 1000000
df1 = pd.DataFrame(np.random.randint(100, size=(N, 4)), columns=['X','Y','Z','T'])
#print (df1)

df1 = pd.DataFrame(np.random.rand(N, 4), columns=['X','Y','Z','T'])
#print (df1)


def jez():
    arr = df.values.argsort(1)[:,:3]
    b = df2.values[np.arange(len(arr))[:,None], arr]
    df3 = pd.DataFrame(df.columns[arr])
    df3.columns = ['Col{}'.format(x+1) for x in df3.columns]
    df4 = pd.DataFrame(b)
    df4.columns = ['Col{}'.format(x+1) for x in df4.columns]


def pir():
    v = df.values
    a = v.argpartition(3, 1)[:, :3]
    c = df.columns.values[a]
    pd.DataFrame(c, df.index)
    d = df2.values[np.arange(len(df))[:, None], a]
    pd.DataFrame(d, df.index, [1, 2, 3]).add_prefix('Col')

def cᴏʟᴅsᴘᴇᴇᴅ():
    #another solution is wrong
    df3 = df.apply(lambda x: df.columns[np.argsort(x)], 1).iloc[:, :3]
    pd.DataFrame({'Col{}'.format(i + 1) : df2.lookup(df3.index, df3.iloc[:, i]) for i in range(df3.shape[1])}, index=df.index)


print (jez())
print (pir())
print (cᴏʟᴅsᴘᴇᴇᴅ())

In [176]: %timeit (jez())
1000 loops, best of 3: 412 µs per loop

In [177]: %timeit (pir())
1000 loops, best of 3: 425 µs per loop

In [178]: %timeit (cᴏʟᴅsᴘᴇᴇᴅ())
100 loops, best of 3: 3.99 ms per loop
like image 163
jezrael Avatar answered Oct 01 '22 15:10

jezrael


I'd use numpy.argpartition as it only looks to partition each row into bottom k and the rest. Its time complexity is O(n) rather that O(nlogn) due to not needing to sort completely.

v = df.values
m = v.shape[1]

a = v.argpartition(3, 1)[:, :3]

c = df.columns.values[a]

We can define df3 based on this.

df3 = pd.DataFrame(c, df.index)

df3

   0  1  2
A  T  Z  X
B  X  T  Z
C  T  X  Z
D  Y  X  Z
E  Y  X  T
F  Y  Z  X
G  X  T  Z
H  X  T  Z
I  X  Z  T
J  Z  T  Y

You can use this to creat df4

d = df2.values[np.arange(len(df))[:, None], a]
df4 = pd.DataFrame(d, df.index, [1, 2, 3]).add_prefix('Col')
df4

    Col1   Col2    Col3
A  0.520  0.200  0.5000
B  0.120  0.232  0.1280
C  0.340  0.430  0.1360
D  0.140  0.424  0.2144
E  0.525  0.650  0.6256
F  0.310  0.610  0.8670
G  0.170  0.527  0.8200
H  0.938  0.380  0.3630
I  0.229  0.542  0.4229
J  0.512  0.730  0.6500
like image 24
piRSquared Avatar answered Oct 01 '22 15:10

piRSquared