Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Fuzzy matching strings in list performance

I'm checking if there are similar results (fuzzy match) in 4 same dataframe columns, and I have the following code, as an example. When I apply it to the real 40.000 rows x 4 columns dataset, keeps running in eternum. The issue is that the code is too slow. For example, if I limite the dataset to 10 users, it takes 8 minutes to compute, while for 20, 19 minutes. Is there anything I am missing? I do not know why this take that long. I expect to have all results, maximum in 2 hours or less. Any hint or help would be greatly appreciated.

from fuzzywuzzy import process
dataframecolumn = ["apple","tb"]
compare = ["adfad","apple","asple","tab"]
Ratios = [process.extract(x,compare) for x in dataframecolumn]
result = list()
for ratio in Ratios:
    for match in ratio:
        if match[1] != 100:
            result.append(match)
            break
print (result) 

Output: [('asple', 80), ('tab', 80)]

like image 649
ecp Avatar asked May 08 '19 12:05

ecp


People also ask

Is fuzzy matching good?

Fuzzy string matching can help improve data quality and accuracy by data deduplication, identification of false-positives etc.

Is FuzzyWuzzy slow?

FuzzyWuzzy package is a Levenshtein distance based method which widely used in computing similarity scores of strings. But why we should not use it? The answer is simple: it is way too slow. The estimated time of computing similarity scores for a 406,000-entity dataset of addresses is 337 hours.

How long is fuzzy matching?

From 3.7 hours to 0.2 seconds. How to perform intelligent string matching in a way that can scale to even the biggest data sets. Same but different. Fuzzy matching of data is an essential first-step for a huge range of data science workflows.

What is Fuzzy Wuzzy in python?

Fuzzywuzzy is a python library that uses Levenshtein Distance to calculate the differences between sequences and patterns that was developed and also open-sourced by SeatGeek, a service that finds event tickets from all over the internet and showcase them on one platform.


1 Answers

Major speed improvements come by writing vectorized operations and avoiding loops

Importing necessary package

from fuzzywuzzy import fuzz
import pandas as pd
import numpy as np

Creating dataframe from first list

dataframecolumn = pd.DataFrame(["apple","tb"])
dataframecolumn.columns = ['Match']

Creating dataframe from second list

compare = pd.DataFrame(["adfad","apple","asple","tab"])
compare.columns = ['compare']

Merge - Cartesian product by introducing key(self join)

dataframecolumn['Key'] = 1
compare['Key'] = 1
combined_dataframe = dataframecolumn.merge(compare,on="Key",how="left")
combined_dataframe = combined_dataframe[~(combined_dataframe.Match==combined_dataframe.compare)]

Vectorization

def partial_match(x,y):
    return(fuzz.ratio(x,y))
partial_match_vector = np.vectorize(partial_match)

Using vectorization and getting desired result by putting threshold on score

combined_dataframe['score']=partial_match_vector(combined_dataframe['Match'],combined_dataframe['compare'])
combined_dataframe = combined_dataframe[combined_dataframe.score>=80]

Results

+--------+-----+--------+------+
| Match  | Key | compare | score
+--------+-----+--------+------+
| apple  | 1   |   asple |    80
|  tb    | 1   |   tab   |    80
+--------+-----+--------+------+
like image 85
Atendra Avatar answered Oct 05 '22 23:10

Atendra