Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

fastest way to do fuzzy matching two strings in pandas data frame

Tags:

I have two data frames with name list

df1[name]   -> number of rows 3000

df2[name]   -> number of rows 64000

I am using fuzzy wuzzy to get the best match for df1 entries from df2 using the following code:

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

matches = [process.extract(x, df1, limit=1) for x in df2]

But this is taking forever to finish. Is there any faster way to do the fuzzy matching of strings in pandas?

like image 999
kunal deep Avatar asked Aug 16 '17 03:08

kunal deep


1 Answers

One improvement i can see in your code is to use generator, so instead of square brackets, you can use round brackets. it will increase the speed by multiple time.

matches = (process.extract(x, df1, limit=1) for x in df2)

Edit: One more suggestion, we can parallelize the operation with multiprocessing library.

like image 154
StatguyUser Avatar answered Sep 25 '22 11:09

StatguyUser