How to merge two pandas DataFrames based on a similarity function?

Question

Given dataset 1

name,x,y
st. peter,1,2
big university portland,3,4

and dataset 2

name,x,y
saint peter3,4
uni portland,5,6

The goal is to merge on

d1.merge(d2, on="name", how="left")

There are no exact matches on name though. So I'm looking to do a kind of fuzzy matching. The technique does not matter in this case, more how to incorporate it efficiently into pandas.

For example, st. peter might match saint peter in the other, but big university portland might be too much of a deviation that we wouldn't match it with uni portland.

One way to think of it is to allow joining with the lowest Levenshtein distance, but only if it is below 5 edits (st. --> saint is 4).

The resulting dataframe should only contain the row st. peter, and contain both "name" variations, and both x and y variables.

Is there a way to do this kind of merging using pandas?

majr · Accepted Answer

Did you look at fuzzywuzzy?

You might do something like:

import pandas as pd
import fuzzywuzzy.process as fwp

choices = list(df2.name)

def fmatch(row): 
    minscore=95 #or whatever score works for you
    choice,score = fwp.extractOne(row.name,choices)
    return choice if score > minscore else None

df1['df2_name'] = df1.apply(fmatch,axis=1)
merged = pd.merge(df1, 
                  df2,
                  left_on='df2_name',
                  right_on='name',
                  suffixes=['_df1','_df2'],
                  how = 'outer') # assuming you want to keep unmatched records

Caveat Emptor: I haven't tried to run this.

How to merge two pandas DataFrames based on a similarity function?

Tags:

python

merge

pandas

fuzzy-comparison

PascalVKooten

1 Answers

majr

Recent Activity

Donate For Us

How to merge two pandas DataFrames based on a similarity function?

Tags:

python

merge

pandas

fuzzy-comparison

PascalVKooten

1 Answers

majr

Related questions

Recent Activity

Donate For Us