How to merge pandas on string contains?

Tags:

I have 2 dataframes that I would like to merge on a common column. However the column I would like to merge on are not of the same string, but rather a string from one is contained in the other as so:

import pandas as pd
df1 = pd.DataFrame({'column_a':['John','Michael','Dan','George', 'Adam'], 'column_common':['code','other','ome','no match','word']})

df2 = pd.DataFrame({'column_b':['Smith','Cohen','Moore','K', 'Faber'], 'column_common':['some string','other string','some code','this code','word']})

The outcome I would like from d1.merge(d2, ...) is the following:

column_a  |  column_b
----------------------
John      |  Moore    <- merged on 'code' contained in 'some code' 
Michael   |  Cohen    <- merged on 'other' contained in 'other string'  
Dan       |  Smith    <- merged on 'ome' contained in 'some string'  
George    |  n/a
Adam      |  Faber    <- merged on 'word' contained in 'word'

253

asked Feb 18 '19 22:02

callmeGuy

2 Answers

New Answer

Here is one approach based on pandas/numpy.

rhs = (df1.column_common
          .apply(lambda x: df2[df2.column_common.str.find(x).ge(0)]['column_b'])
          .bfill(axis=1)
          .iloc[:, 0])

(pd.concat([df1.column_a, rhs], axis=1, ignore_index=True)
 .rename(columns={0: 'column_a', 1: 'column_b'}))

  column_a column_b
0     John    Moore
1  Michael    Cohen
2      Dan    Smith
3   George      NaN
4     Adam    Faber

Old Answer

Here's a solution for left-join behaviour, as in it doesn't keep column_a values that do not match any column_b values. This is slower than the above numpy/pandas solution because it uses two nested iterrows loops to build a python list.

tups = [(a1, a2) for i, (a1, b1) in df1.iterrows() 
                 for j, (a2, b2) in df2.iterrows()
        if b1 in b2]

(pd.DataFrame(tups, columns=['column_a', 'column_b'])
   .drop_duplicates('column_a')
   .reset_index(drop=True))

  column_a column_b
0     John    Moore
1  Michael    Cohen
2      Dan    Smith
3     Adam    Faber

answered Oct 02 '22 15:10

Peter Leimbigler

My solution involves applying a function to the common column. I can't imagine it holds up well when df2 is large but perhaps someone more knowledgeable than I can suggest an improvement.

def strmerge(strcolumn):
    for i in df2['column_common']:
        if strcolumn in i:
            return df2[df2['column_common'] == i]['column_b'].values[0]

df1['column_b'] = df1['column_common'].apply(strmerge)

df1
    column_a    column_common   column_b
0   John        code            Moore
1   Michael     other           Cohen
2   Dan         ome             Smith
3   George      no match        None
4   Adam        word            Faber

answered Oct 02 '22 15:10

Chris Decker

Related questions
                            
                                pdb interactive mode throught telnet and rdb
                            
                                Share async-await coroutine based complex object across multiprocess
                            
                                How to access form data in `FormView.get_success_url()`
                            
                                Regex using increasing sequence of numbers Python
                            
                                How to install tensorflow GPU version on VirtualBox Ubuntu OS. And host OS is windows 10
                            
                                Saving class-based view formset items with a new "virtual" column
                            
                                Section postgresql not found in the database.ini file
                            
                                type hints for method annotated with @property
                            
                                Better usage of `make_pass_decorator` in Python Click
                            
                                Reusing a group of Keras layers
                            
                                Why is my script's directory not in the Python sys.path?
                            
                                How to switch between python version Windows
                            
                                Save SHAP summary plot as PDF/SVG
                            
                                Cyclic Imports to fix R0401 from pylint
                            
                                Get fully rendered HTML using Selenium webdriver and Python
                            
                                Can a jupyter notebook find its own filename?
                            
                                Handling multiple input values for single html form in django
                            
                                Get Jupyter notebook to display matplotlib figures in real-time
                            
                                What does df.repartition with no column arguments partition on?
                            
                                SQLAlchemy warning Textual column expression should be explicitly declared?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to merge pandas on string contains?

Tags:

python

merge

pandas