I have 2 dataframes that I would like to merge on a common column. However the column I would like to merge on are not of the same string, but rather a string from one is contained in the other as so:
import pandas as pd
df1 = pd.DataFrame({'column_a':['John','Michael','Dan','George', 'Adam'], 'column_common':['code','other','ome','no match','word']})
df2 = pd.DataFrame({'column_b':['Smith','Cohen','Moore','K', 'Faber'], 'column_common':['some string','other string','some code','this code','word']})
The outcome I would like from d1.merge(d2, ...)
is the following:
column_a | column_b
----------------------
John | Moore <- merged on 'code' contained in 'some code'
Michael | Cohen <- merged on 'other' contained in 'other string'
Dan | Smith <- merged on 'ome' contained in 'some string'
George | n/a
Adam | Faber <- merged on 'word' contained in 'word'
Pandas DataFrame merge() Method The merge() method updates the content of two DataFrame by merging them together, using the specified method(s). Use the parameters to control which values to keep and which to replace.
The concat() function in pandas is used to append either columns or rows from one DataFrame to another. The concat() function does all the heavy lifting of performing concatenation operations along an axis while performing optional set logic (union or intersection) of the indexes (if any) on the other axes.
Merging Dataframes by index of both the dataframes As both the dataframe contains similar IDs on the index. So, to merge the dataframe on indices pass the left_index & right_index arguments as True i.e. Both the dataframes are merged on index using default Inner Join.
Concat function concatenates dataframes along rows or columns. We can think of it as stacking up multiple dataframes. Merge combines dataframes based on values in shared columns. Merge function offers more flexibility compared to concat function because it allows combinations based on a condition.
Here is one approach based on pandas/numpy.
rhs = (df1.column_common
.apply(lambda x: df2[df2.column_common.str.find(x).ge(0)]['column_b'])
.bfill(axis=1)
.iloc[:, 0])
(pd.concat([df1.column_a, rhs], axis=1, ignore_index=True)
.rename(columns={0: 'column_a', 1: 'column_b'}))
column_a column_b
0 John Moore
1 Michael Cohen
2 Dan Smith
3 George NaN
4 Adam Faber
Here's a solution for left-join behaviour, as in it doesn't keep column_a
values that do not match any column_b
values. This is slower than the above numpy/pandas solution because it uses two nested iterrows
loops to build a python list.
tups = [(a1, a2) for i, (a1, b1) in df1.iterrows()
for j, (a2, b2) in df2.iterrows()
if b1 in b2]
(pd.DataFrame(tups, columns=['column_a', 'column_b'])
.drop_duplicates('column_a')
.reset_index(drop=True))
column_a column_b
0 John Moore
1 Michael Cohen
2 Dan Smith
3 Adam Faber
My solution involves applying a function to the common column. I can't imagine it holds up well when df2 is large but perhaps someone more knowledgeable than I can suggest an improvement.
def strmerge(strcolumn):
for i in df2['column_common']:
if strcolumn in i:
return df2[df2['column_common'] == i]['column_b'].values[0]
df1['column_b'] = df1['column_common'].apply(strmerge)
df1
column_a column_common column_b
0 John code Moore
1 Michael other Cohen
2 Dan ome Smith
3 George no match None
4 Adam word Faber
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With