Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

merging pandas dataframes with respect to a function output

Is there a convenient way to merge two dataframes with respect to the distance between rows? For the following example, I want to get the color for df1 rows from the closest df2 rows. The distance should be computed as ((x1-x2)**0.5+(y1-y2)**0.5)**0.5.

import pandas as pd

df1 = pd.DataFrame({'x': [50,16,72,61,95,47],'y': [14,22,11,45,58,56],'size':[1,4,3,7,6,5]})
df2 = pd.DataFrame({'x': [10,21,64,31,25,55],'y': [54,76,68,24,34,19],'color':['red','green','blue','white','brown','black']})
like image 267
JBrons Avatar asked Aug 29 '20 12:08

JBrons


People also ask

When using the merge () function on two DataFrames Which of the following joins is likely to preserve the most keys in the result?

Inner joins The most common type of join is called an inner join. An inner join combines two DataFrames based on a join key and returns a new DataFrame that contains only those rows that have matching values in both of the original DataFrames.

How do I merge two DataFrames in pandas?

The concat() function can be used to concatenate two Dataframes by adding the rows of one to the other. The merge() function is equivalent to the SQL JOIN clause. 'left', 'right' and 'inner' joins are all possible.

Which function is used to merge DataFrames?

Pandas DataFrame merge() function is used to merge two DataFrame objects with a database-style join operation. The joining is performed on columns or indexes. If the joining is done on columns, indexes are ignored. This function returns a new DataFrame and the source DataFrame objects are unchanged.

How do I merge two DataFrames in pandas based on common column?

To merge two Pandas DataFrame with common column, use the merge() function and set the ON parameter as the column name.


2 Answers

# function to compare one row of df1 with every row of df2
# note the use of abs() here, square root of negative numbers would be complex number, 
# so the result of the computation would be NaN. abs() helps to avoids that
def compare(x, y):
    df2['distance'] = (abs(x-df2['x'])**0.5 + abs(y-df2['y'])**0.5)**0.5
    return df2.loc[df2['distance'].idxmin()]['color']

df1['color'] = df1.apply(lambda row: compare(row['x'], row['y']), axis=1)
print(df1)

    x   y  size  color
0  50  14     1  black
1  16  22     4  white
2  72  11     3  black
3  61  45     7   blue
4  95  58     6   blue
5  47  56     5    red
like image 121
Rajesh Avatar answered Oct 20 '22 14:10

Rajesh


Something from numpy broadcast

df1['color']=df2.color.iloc[np.argmin(np.sum(np.abs(df1[['x','y']].values-df2[['x','y']].values[:,None])**0.5,2),0)].values
df1
Out[79]: 
    x   y  size  color
0  50  14     1  black
1  16  22     4  white
2  72  11     3  black
3  61  45     7   blue
4  95  58     6   blue
5  47  56     5    red
like image 6
BENY Avatar answered Oct 20 '22 16:10

BENY