I have two dataframes, say A
and B
, that have some columns named attr1
, attr2
, attrN
.
I have a certain distance measure, and I would like to merge the dataframes, such that each row in A
is merged with the row in B
that has the shortest distance between attributes. Note that rows in B
can be repeated when merging.
For example (with one attribute to keep things simple), merging these two tables using absolute difference distance |A.attr1 - B.att1|
A | attr1 B | attr1
0 | 10 0 | 15
1 | 20 1 | 27
2 | 30 2 | 80
should yield the following merged table
M | attr1_A attr1_B
0 | 10 15
1 | 20 15
2 | 30 27
My current way of doing this is slow and is based on comparing each row of A
with each row of B
, but code is also not clear because I have to preserve indices for merging and I am not satisfied at all, but I cannot come up with a better solution.
How can I perform the merge as above using pandas? Are there any convenience methods or functions that can be helpful here?
EDIT: Just to clarify, in the dataframes there are also other columns which are not used in the distance calculation, but have to be merged as well.
One way you could do it as follows:
A = pd.DataFrame({'attr1':[10,20,30]})
B = pd.DataFrame({'attr1':[15,15,27]})
Use a cross join to get all combinations
Update for 1.2+ pandas use how='cross'
merge_AB = A.merge(B, how='cross', suffixes = ('_A', '_B'))
Older pandas version use psuedo key...
A = A.assign(key=1)
B = B.assign(key=1)
merged_AB =pd.merge(A,B, on='key',suffixes=('_A','_B'))
Now let's find the min distances in merged_AB
M = merged_AB.groupby('attr1_A').apply(lambda x:abs(x['attr1_A']-x['attr1_B'])==abs(x['attr1_A']-x['attr1_B']).min())
merged_AB[M.values].drop_duplicates().drop('key',axis=1)
Output:
attr1_A attr1_B
0 10 15
3 20 15
8 30 27
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With