I am trying to understand how the python module fuzzywuzzy's function process.extract() work?
I mainly read about the fuzzywuzzy package here: http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/, which is a great post explanining different scenarios when trying to do fuzzy matching. They discussed several scenarios for Partial String Similarity:
1) Out Of Order
2) Token Sort
3) Token Set
And then, from this post: https://pathindependence.wordpress.com/2015/10/31/tutorial-fuzzywuzzy-string-matching-in-python-improving-merge-accuracy-across-data-products-and-naming-conventions/ I learned how to use fuzzywuzzy's process.extract() function to basically select the top k matches.
I cannot find too much info regarding how the process.extract() function works. Here's the definition/information I found on their GitHub page (https://github.com/seatgeek/fuzzywuzzy/blob/master/fuzzywuzzy/process.py), that this function:
Find best matches in a list or dictionary of choices, return a list of tuples containing the match and it's score. If a dictionary is used, also returns the key for each match.
However, it does not provide details regarding HOW it's finding the best? Did it take all the 3 scenarios I've mentioned above to find this?
The reason why I ask, is because, when I used this function, sometimes there are two strings that are very similar but are not matched.
for example in my current sample data set, for the to-be-match-string
"Total replenishment lead time (in workdays)"
it is matched to
"PLANNING_TIME_FENCE_CODE", "BUILD_IN_WIP_FLAG"
but not to (the right answer)
"FULL_LEAD_TIME"
Even though the right answer has "lead time" just like the to-be-match-string does, it is not matched to the to-be-match-string at all. WHY? and somehow, the other ones that do not look like the to-be-match-string get to be matched. WHY? I am quite clueless now.
Fuzzywuzzy is a python library that uses Levenshtein Distance to calculate the differences between sequences and patterns that was developed and also open-sourced by SeatGeek, a service that finds event tickets from all over the internet and showcase them on one platform.
FuzzyWuzzy is a library of Python which is used for string matching. Fuzzy string matching is the process of finding strings that match a given pattern. Basically it uses Levenshtein Distance to calculate the differences between sequences.
FuzzyWuzzy in Python Just like the Levenshtein package, FuzzyWuzzy has a ratio function that calculates the standard Levenshtein distance similarity ratio between two sequences.
The other answer is wrong in a key respect - the inference that the result of process.extract
was the same as fuzz.partial_ratio
in one case, therefore they are doing the same thing by default.
process.extract
actually uses WRatio()
by default, which is a weighted combination of the four fuzz
ratios. This is actually a cool functionality that empirically works pretty well across fuzzy matching scenarios.
Still, you can manually specify the string comparison function via the scorer
argument to extract
Source for process.extract
:https://github.com/seatgeek/fuzzywuzzy/blob/master/fuzzywuzzy/process.py
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With