is it possible to do fuzzy match merge with python pandas?

Tags:

I have two DataFrames which I want to merge based on a column. However, due to alternate spellings, different number of spaces, absence/presence of diacritical marks, I would like to be able to merge as long as they are similar to one another.

Any similarity algorithm will do (soundex, Levenshtein, difflib's).

Say one DataFrame has the following data:

df1 = DataFrame([[1],[2],[3],[4],[5]], index=['one','two','three','four','five'], columns=['number'])

       number
one         1
two         2
three       3
four        4
five        5

df2 = DataFrame([['a'],['b'],['c'],['d'],['e']], index=['one','too','three','fours','five'], columns=['letter'])

      letter
one        a
too        b
three      c
fours      d
five       e

Then I want to get the resulting DataFrame

       number letter
one         1      a
two         2      b
three       3      c
four        4      d
five        5      e

602

asked Nov 29 '12 23:11

pocketfullofcheese

2 Answers

Similar to @locojay suggestion, you can apply difflib's get_close_matches to df2's index and then apply a join:

In [23]: import difflib 

In [24]: difflib.get_close_matches
Out[24]: <function difflib.get_close_matches>

In [25]: df2.index = df2.index.map(lambda x: difflib.get_close_matches(x, df1.index)[0])

In [26]: df2
Out[26]: 
      letter
one        a
two        b
three      c
four       d
five       e

In [31]: df1.join(df2)
Out[31]: 
       number letter
one         1      a
two         2      b
three       3      c
four        4      d
five        5      e

If these were columns, in the same vein you could apply to the column then merge:

df1 = DataFrame([[1,'one'],[2,'two'],[3,'three'],[4,'four'],[5,'five']], columns=['number', 'name'])
df2 = DataFrame([['a','one'],['b','too'],['c','three'],['d','fours'],['e','five']], columns=['letter', 'name'])

df2['name'] = df2['name'].apply(lambda x: difflib.get_close_matches(x, df1['name'])[0])
df1.merge(df2)

answered Oct 24 '22 12:10

Andy Hayden

Using `fuzzywuzzy`

Since there are no examples with the fuzzywuzzy package, here's a function I wrote which will return all matches based on a threshold you can set as a user:

Example datframe

df1 = pd.DataFrame({'Key':['Apple', 'Banana', 'Orange', 'Strawberry']})
df2 = pd.DataFrame({'Key':['Aple', 'Mango', 'Orag', 'Straw', 'Bannanna', 'Berry']})

# df1
          Key
0       Apple
1      Banana
2      Orange
3  Strawberry

# df2
        Key
0      Aple
1     Mango
2      Orag
3     Straw
4  Bannanna
5     Berry

Function for fuzzy matching

def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
    """
    :param df_1: the left table to join
    :param df_2: the right table to join
    :param key1: key column of the left table
    :param key2: key column of the right table
    :param threshold: how close the matches should be to return a match, based on Levenshtein distance
    :param limit: the amount of matches that will get returned, these are sorted high to low
    :return: dataframe with boths keys and matches
    """
    s = df_2[key2].tolist()
    
    m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))    
    df_1['matches'] = m
    
    m2 = df_1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
    df_1['matches'] = m2
    
    return df_1

Using our function on the dataframes: #1

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

fuzzy_merge(df1, df2, 'Key', 'Key', threshold=80)

          Key       matches
0       Apple          Aple
1      Banana      Bannanna
2      Orange          Orag
3  Strawberry  Straw, Berry

Using our function on the dataframes: #2

df1 = pd.DataFrame({'Col1':['Microsoft', 'Google', 'Amazon', 'IBM']})
df2 = pd.DataFrame({'Col2':['Mcrsoft', 'gogle', 'Amason', 'BIM']})

fuzzy_merge(df1, df2, 'Col1', 'Col2', 80)

        Col1  matches
0  Microsoft  Mcrsoft
1     Google    gogle
2     Amazon   Amason
3        IBM

Installation:

Pip

pip install fuzzywuzzy

Anaconda

conda install -c conda-forge fuzzywuzzy

answered Oct 24 '22 12:10

Erfan

Related questions
                            
                                Understanding min_df and max_df in scikit CountVectorizer
                            
                                Choosing the correct upper and lower HSV boundaries for color detection with`cv::inRange` (OpenCV)
                            
                                Public free web services for testing soap client [closed]
                            
                                Why are assertEquals() parameters in the order (expected, actual)?
                            
                                WhatsApp API (java/python) [closed]
                            
                                What is the role of TimeDistributed layer in Keras?
                            
                                Add numpy array as column to Pandas data frame
                            
                                Python regex - r prefix
                            
                                Two-sample Kolmogorov-Smirnov Test in Python Scipy
                            
                                How to do an upsert with SqlAlchemy?
                            
                                background function in Python
                            
                                Why can I use a list index as an indexing variable in a for loop? [duplicate]
                            
                                Imshow: extent and aspect
                            
                                How to transform numpy.matrix or array to scipy sparse matrix
                            
                                What exactly is the point of memoryview in Python
                            
                                Safely create a file if and only if it does not exist with Python
                            
                                Is enumerate in python lazy?
                            
                                Using ConfigParser to read a file without section name
                            
                                Using a pre-trained word embedding (word2vec or Glove) in TensorFlow
                            
                                Dynamically import a method in a file, from a string

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

is it possible to do fuzzy match merge with python pandas?

Tags:

python

pandas

pocketfullofcheese

People also ask

2 Answers

Andy Hayden

Using `fuzzywuzzy`

Installation:

Erfan

Recent Activity

Donate For Us

is it possible to do fuzzy match merge with python pandas?

Tags:

python

pandas

pocketfullofcheese

People also ask

2 Answers

Andy Hayden

Using fuzzywuzzy

Installation:

Erfan

Related questions

Recent Activity

Donate For Us

Using `fuzzywuzzy`