Python Fuzzy matching strings in list performance

Tags:

I'm checking if there are similar results (fuzzy match) in 4 same dataframe columns, and I have the following code, as an example. When I apply it to the real 40.000 rows x 4 columns dataset, keeps running in eternum. The issue is that the code is too slow. For example, if I limite the dataset to 10 users, it takes 8 minutes to compute, while for 20, 19 minutes. Is there anything I am missing? I do not know why this take that long. I expect to have all results, maximum in 2 hours or less. Any hint or help would be greatly appreciated.

from fuzzywuzzy import process
dataframecolumn = ["apple","tb"]
compare = ["adfad","apple","asple","tab"]
Ratios = [process.extract(x,compare) for x in dataframecolumn]
result = list()
for ratio in Ratios:
    for match in ratio:
        if match[1] != 100:
            result.append(match)
            break
print (result)

Output: [('asple', 80), ('tab', 80)]

649

asked May 08 '19 12:05

ecp

1 Answers

Major speed improvements come by writing vectorized operations and avoiding loops

Importing necessary package

from fuzzywuzzy import fuzz
import pandas as pd
import numpy as np

Creating dataframe from first list

dataframecolumn = pd.DataFrame(["apple","tb"])
dataframecolumn.columns = ['Match']

Creating dataframe from second list

compare = pd.DataFrame(["adfad","apple","asple","tab"])
compare.columns = ['compare']

Merge - Cartesian product by introducing key(self join)

dataframecolumn['Key'] = 1
compare['Key'] = 1
combined_dataframe = dataframecolumn.merge(compare,on="Key",how="left")
combined_dataframe = combined_dataframe[~(combined_dataframe.Match==combined_dataframe.compare)]

Vectorization

def partial_match(x,y):
    return(fuzz.ratio(x,y))
partial_match_vector = np.vectorize(partial_match)

Using vectorization and getting desired result by putting threshold on score

combined_dataframe['score']=partial_match_vector(combined_dataframe['Match'],combined_dataframe['compare'])
combined_dataframe = combined_dataframe[combined_dataframe.score>=80]

Results

+--------+-----+--------+------+
| Match  | Key | compare | score
+--------+-----+--------+------+
| apple  | 1   |   asple |    80
|  tb    | 1   |   tab   |    80
+--------+-----+--------+------+

answered Oct 05 '22 23:10

Atendra

Related questions
                            
                                Speed of writing a numpy array to a text file
                            
                                Pandas sum of next n rows
                            
                                How to avoid "Incorrect padding" error while Base64 Decoding this string in Python
                            
                                Change Windows 10 background in Python 3
                            
                                How do I implement a PyTorch Dataset for use with AWS SageMaker?
                            
                                Plotting a map using geopandas and matplotlib
                            
                                How to compose a list with conditional elements
                            
                                Python find duplicates which occur more than 3 times
                            
                                How to change marker size/scale in legend when marker is set to pixel
                            
                                Pandas dataframe groupby and sort
                            
                                What does ksize and k mean in cornerHarris?
                            
                                How to fix inconsistent return statement in python?
                            
                                Given a start color and a middle color, how to get the remaining colors? (Python)
                            
                                How to update a Postgres table column using a pandas data frame?
                            
                                Python list comprehension for if else statemets
                            
                                Pause Jupyter Notebook widgets, waiting for user input
                            
                                How to compile the resources.qrc file with pyrcc5
                            
                                Best way to combine a permutation of conditional statements
                            
                                How to get decision function in randomforest in sklearn
                            
                                Remove rows of a dataframe based on the row number

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python Fuzzy matching strings in list performance

Tags:

python

string-matching

duplicates

fuzzy-search

ecp

People also ask