I have two dataframe, df and df2,they are correspondent. Now based in the first dataframe df, I want to get the 3 smallest value in one row and return the correspondent column's name(in this case like "X"or"Y"or"Z"or"T"). So I can get the new dataframe df3.
df = pd.DataFrame({
'X': [21, 2, 43, 44, 56, 67, 7, 38, 29, 130],
'Y': [101, 220, 330, 140, 250, 10, 207, 320, 420, 50],
'Z': [20, 128, 136, 144, 312, 10, 82, 63, 42, 12],
'T': [2, 32, 4, 424, 256, 167, 27, 38, 229, 30]
}, index=list('ABCDEFGHIJ'))
df2 = pd.DataFrame({
'X': [0.5, 0.12,0.43, 0.424, 0.65,0.867,0.17,0.938,0.229,0.113],
'Y': [0.1,2.201,0.33,0.140,0.525,0.31,0.20,0.32,0.420,0.650],
'Z': [0.20,0.128,0.136,0.2144,0.5312,0.61,0.82,0.363,0.542,0.512],
'T':[0.52, 0.232,0.34, 0.6424, 0.6256,0.3167,0.527,0.38,0.4229,0.73]
},index=list('ABCDEFGHIJ'))
Besides that, I want to get another dataframe df4 which is correspondent from df3 in df2 which means in df row['A'] (2,20,21) is the 3 smallest value, so in df4 row['A'], I want to get (0.52,0.2,0.5) from df2.
To find minimum value of every row in DataFrame just call the min() member function with DataFrame object with argument axis=1 i.e. It returned a series with row index label and minimum value of each row.
Use min() function on a dataframe with 'axis = 1' attribute to find the minimum value over the row axis. 3) Get minimum values of every column without skipping None Value : Use min() function on a dataframe which has Na value with 'skipna = False' attribute to find the minimum value over the column axis.
If the cells are in a contiguous row or columnSelect a cell below or to the right of the numbers for which you want to find the smallest number. , click Min (calculates the smallest) or Max (calculates the largest), and then press ENTER.
Use the COUNT function in a formula to count the number of numeric values in a range. In the above example, A2, A3, and A6 are the only cells that contains numeric values in the range, hence the output is 3.
You can use if both DataFrames
has same columns names in same order argsort
for indices:
arr = df.values.argsort(1)[:,:3]
print (arr)
[[0 3 1]
[1 0 3]
[0 1 3]
[1 2 3]
[1 2 0]
[2 3 1]
[1 0 3]
[0 1 3]
[1 3 0]
[3 0 2]]
#get values by indices in arr
b = df2.values[np.arange(len(arr))[:,None], arr]
print (b)
[[ 0.52 0.2 0.5 ]
[ 0.12 0.232 0.128 ]
[ 0.34 0.43 0.136 ]
[ 0.424 0.14 0.2144]
[ 0.65 0.525 0.6256]
[ 0.31 0.61 0.867 ]
[ 0.17 0.527 0.82 ]
[ 0.38 0.938 0.363 ]
[ 0.229 0.542 0.4229]
[ 0.512 0.73 0.65 ]]
Last use DataFrame
constructors:
df3 = pd.DataFrame(df.columns[arr])
df3.columns = ['Col{}'.format(x+1) for x in df3.columns]
print (df3)
Col1 Col2 Col3
0 T Z X
1 X T Z
2 T X Z
3 X Y Z
4 X Y T
5 Y Z X
6 X T Z
7 T X Z
8 X Z T
9 Z T Y
df4 = pd.DataFrame(b)
df4.columns = ['Col{}'.format(x+1) for x in df4.columns]
print (df4)
Col1 Col2 Col3
0 0.520 0.200 0.5000
1 0.120 0.232 0.1280
2 0.340 0.430 0.1360
3 0.424 0.140 0.2144
4 0.650 0.525 0.6256
5 0.310 0.610 0.8670
6 0.170 0.527 0.8200
7 0.380 0.938 0.3630
8 0.229 0.542 0.4229
9 0.512 0.730 0.6500
Answers are similar, so I create timings:
np.random.seed(14)
N = 1000000
df1 = pd.DataFrame(np.random.randint(100, size=(N, 4)), columns=['X','Y','Z','T'])
#print (df1)
df1 = pd.DataFrame(np.random.rand(N, 4), columns=['X','Y','Z','T'])
#print (df1)
def jez():
arr = df.values.argsort(1)[:,:3]
b = df2.values[np.arange(len(arr))[:,None], arr]
df3 = pd.DataFrame(df.columns[arr])
df3.columns = ['Col{}'.format(x+1) for x in df3.columns]
df4 = pd.DataFrame(b)
df4.columns = ['Col{}'.format(x+1) for x in df4.columns]
def pir():
v = df.values
a = v.argpartition(3, 1)[:, :3]
c = df.columns.values[a]
pd.DataFrame(c, df.index)
d = df2.values[np.arange(len(df))[:, None], a]
pd.DataFrame(d, df.index, [1, 2, 3]).add_prefix('Col')
def cᴏʟᴅsᴘᴇᴇᴅ():
#another solution is wrong
df3 = df.apply(lambda x: df.columns[np.argsort(x)], 1).iloc[:, :3]
pd.DataFrame({'Col{}'.format(i + 1) : df2.lookup(df3.index, df3.iloc[:, i]) for i in range(df3.shape[1])}, index=df.index)
print (jez())
print (pir())
print (cᴏʟᴅsᴘᴇᴇᴅ())
In [176]: %timeit (jez())
1000 loops, best of 3: 412 µs per loop
In [177]: %timeit (pir())
1000 loops, best of 3: 425 µs per loop
In [178]: %timeit (cᴏʟᴅsᴘᴇᴇᴅ())
100 loops, best of 3: 3.99 ms per loop
I'd use numpy.argpartition
as it only looks to partition each row into bottom k
and the rest. Its time complexity is O(n)
rather that O(nlogn)
due to not needing to sort completely.
v = df.values
m = v.shape[1]
a = v.argpartition(3, 1)[:, :3]
c = df.columns.values[a]
We can define df3
based on this.
df3 = pd.DataFrame(c, df.index)
df3
0 1 2
A T Z X
B X T Z
C T X Z
D Y X Z
E Y X T
F Y Z X
G X T Z
H X T Z
I X Z T
J Z T Y
You can use this to creat df4
d = df2.values[np.arange(len(df))[:, None], a]
df4 = pd.DataFrame(d, df.index, [1, 2, 3]).add_prefix('Col')
df4
Col1 Col2 Col3
A 0.520 0.200 0.5000
B 0.120 0.232 0.1280
C 0.340 0.430 0.1360
D 0.140 0.424 0.2144
E 0.525 0.650 0.6256
F 0.310 0.610 0.8670
G 0.170 0.527 0.8200
H 0.938 0.380 0.3630
I 0.229 0.542 0.4229
J 0.512 0.730 0.6500
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With