I'm checking if there are similar results (fuzzy match) in 4 same dataframe columns, and I have the following code, as an example. When I apply it to the real 40.000 rows x 4 columns dataset, keeps running in eternum. The issue is that the code is too slow. For example, if I limite the dataset to 10 users, it takes 8 minutes to compute, while for 20, 19 minutes. Is there anything I am missing? I do not know why this take that long. I expect to have all results, maximum in 2 hours or less. Any hint or help would be greatly appreciated.
from fuzzywuzzy import process
dataframecolumn = ["apple","tb"]
compare = ["adfad","apple","asple","tab"]
Ratios = [process.extract(x,compare) for x in dataframecolumn]
result = list()
for ratio in Ratios:
for match in ratio:
if match[1] != 100:
result.append(match)
break
print (result)
Output: [('asple', 80), ('tab', 80)]
Fuzzy string matching can help improve data quality and accuracy by data deduplication, identification of false-positives etc.
FuzzyWuzzy package is a Levenshtein distance based method which widely used in computing similarity scores of strings. But why we should not use it? The answer is simple: it is way too slow. The estimated time of computing similarity scores for a 406,000-entity dataset of addresses is 337 hours.
From 3.7 hours to 0.2 seconds. How to perform intelligent string matching in a way that can scale to even the biggest data sets. Same but different. Fuzzy matching of data is an essential first-step for a huge range of data science workflows.
Fuzzywuzzy is a python library that uses Levenshtein Distance to calculate the differences between sequences and patterns that was developed and also open-sourced by SeatGeek, a service that finds event tickets from all over the internet and showcase them on one platform.
Major speed improvements come by writing vectorized operations and avoiding loops
from fuzzywuzzy import fuzz
import pandas as pd
import numpy as np
dataframecolumn = pd.DataFrame(["apple","tb"])
dataframecolumn.columns = ['Match']
compare = pd.DataFrame(["adfad","apple","asple","tab"])
compare.columns = ['compare']
dataframecolumn['Key'] = 1
compare['Key'] = 1
combined_dataframe = dataframecolumn.merge(compare,on="Key",how="left")
combined_dataframe = combined_dataframe[~(combined_dataframe.Match==combined_dataframe.compare)]
def partial_match(x,y):
return(fuzz.ratio(x,y))
partial_match_vector = np.vectorize(partial_match)
combined_dataframe['score']=partial_match_vector(combined_dataframe['Match'],combined_dataframe['compare'])
combined_dataframe = combined_dataframe[combined_dataframe.score>=80]
+--------+-----+--------+------+
| Match | Key | compare | score
+--------+-----+--------+------+
| apple | 1 | asple | 80
| tb | 1 | tab | 80
+--------+-----+--------+------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With