Logo Questions Linux Laravel Mysql Ubuntu Git Menu

How to merge pandas on string contains?

I have 2 dataframes that I would like to merge on a common column. However the column I would like to merge on are not of the same string, but rather a string from one is contained in the other as so:

import pandas as pd
df1 = pd.DataFrame({'column_a':['John','Michael','Dan','George', 'Adam'], 'column_common':['code','other','ome','no match','word']})

df2 = pd.DataFrame({'column_b':['Smith','Cohen','Moore','K', 'Faber'], 'column_common':['some string','other string','some code','this code','word']})

The outcome I would like from d1.merge(d2, ...) is the following:

column_a  |  column_b
John      |  Moore    <- merged on 'code' contained in 'some code' 
Michael   |  Cohen    <- merged on 'other' contained in 'other string'  
Dan       |  Smith    <- merged on 'ome' contained in 'some string'  
George    |  n/a
Adam      |  Faber    <- merged on 'word' contained in 'word'  
like image 253
callmeGuy Avatar asked Feb 18 '19 22:02


People also ask

How do I merge values in pandas?

Pandas DataFrame merge() Method The merge() method updates the content of two DataFrame by merging them together, using the specified method(s). Use the parameters to control which values to keep and which to replace.

How do I merge two DFS in pandas?

The concat() function in pandas is used to append either columns or rows from one DataFrame to another. The concat() function does all the heavy lifting of performing concatenation operations along an axis while performing optional set logic (union or intersection) of the indexes (if any) on the other axes.

Can you merge on index pandas?

Merging Dataframes by index of both the dataframes As both the dataframe contains similar IDs on the index. So, to merge the dataframe on indices pass the left_index & right_index arguments as True i.e. Both the dataframes are merged on index using default Inner Join.

What is difference between pandas concat and merge?

Concat function concatenates dataframes along rows or columns. We can think of it as stacking up multiple dataframes. Merge combines dataframes based on values in shared columns. Merge function offers more flexibility compared to concat function because it allows combinations based on a condition.

2 Answers

New Answer

Here is one approach based on pandas/numpy.

rhs = (df1.column_common
          .apply(lambda x: df2[df2.column_common.str.find(x).ge(0)]['column_b'])
          .iloc[:, 0])

(pd.concat([df1.column_a, rhs], axis=1, ignore_index=True)
 .rename(columns={0: 'column_a', 1: 'column_b'}))

  column_a column_b
0     John    Moore
1  Michael    Cohen
2      Dan    Smith
3   George      NaN
4     Adam    Faber

Old Answer

Here's a solution for left-join behaviour, as in it doesn't keep column_a values that do not match any column_b values. This is slower than the above numpy/pandas solution because it uses two nested iterrows loops to build a python list.

tups = [(a1, a2) for i, (a1, b1) in df1.iterrows() 
                 for j, (a2, b2) in df2.iterrows()
        if b1 in b2]

(pd.DataFrame(tups, columns=['column_a', 'column_b'])

  column_a column_b
0     John    Moore
1  Michael    Cohen
2      Dan    Smith
3     Adam    Faber
like image 52
Peter Leimbigler Avatar answered Oct 02 '22 15:10

Peter Leimbigler

My solution involves applying a function to the common column. I can't imagine it holds up well when df2 is large but perhaps someone more knowledgeable than I can suggest an improvement.

def strmerge(strcolumn):
    for i in df2['column_common']:
        if strcolumn in i:
            return df2[df2['column_common'] == i]['column_b'].values[0]

df1['column_b'] = df1['column_common'].apply(strmerge)

    column_a    column_common   column_b
0   John        code            Moore
1   Michael     other           Cohen
2   Dan         ome             Smith
3   George      no match        None
4   Adam        word            Faber
like image 32
Chris Decker Avatar answered Oct 02 '22 15:10

Chris Decker