I am trying to understand how the python module fuzzywuzzy's function process.extract() work? I mainly read about the fuzzywuzzy package here: http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/, which is a great post explanining different scenarios when trying to do fuzzy matching. They discussed several scenarios for Partial String Similarity: <blockquote> 1) Out Of Order 2) Token Sort 3) Token Set </blockquote> And then, from this post: https://pathindependence.wordpress.com/2015/10/31/tutorial-fuzzywuzzy-string-matching-in-python-improving-merge-accuracy-across-data-products-and-naming-conventions/ I learned how to use fuzzywuzzy's process.extract() function to basically select the top k matches. I cannot find too much info regarding how the process.extract() function works. Here's the definition/information I found on their GitHub page (https://github.com/seatgeek/fuzzywuzzy/blob/master/fuzzywuzzy/process.py), that this function: <blockquote> Find best matches in a list or dictionary of choices, return a list of tuples containing the match and it's score. If a dictionary is used, also returns the key for each match. </blockquote> However, it does not provide details regarding HOW it's finding the best? Did it take all the 3 scenarios I've mentioned above to find this? The reason why I ask, is because, when I used this function, sometimes there are two strings that are very similar but are not matched. for example in my current sample data set, for the to-be-match-string <blockquote> "Total replenishment lead time (in workdays)" </blockquote> it is matched to <blockquote> "PLANNING_TIME_FENCE_CODE", "BUILD_IN_WIP_FLAG" </blockquote> but not to (the right answer) <blockquote> "FULL_LEAD_TIME" </blockquote> Even though the right answer has "lead time" just like the to-be-match-string does, it is not matched to the to-be-match-string at all. WHY? and somehow, the other ones that do not look like the to-be-match-string get to be matched. WHY? I am quite clueless now.

The other answer is wrong in a key respect - the inference that the result of <code>process.extract</code> was the same as <code>fuzz.partial_ratio</code> in one case, therefore they are doing the same thing by default. <code>process.extract</code> actually uses <code>WRatio()</code> by default, which is a weighted combination of the four <code>fuzz</code> ratios. This is actually a cool functionality that empirically works pretty well across fuzzy matching scenarios. Still, you can manually specify the string comparison function via the <code>scorer</code> argument to <code>extract</code> Source for <code>process.extract</code>:https://github.com/seatgeek/fuzzywuzzy/blob/master/fuzzywuzzy/process.py

python fuzzywuzzy's process.extract(): how does it work?

Tags:

python

string

fuzzywuzzy

I am trying to understand how the python module fuzzywuzzy's function process.extract() work?

I mainly read about the fuzzywuzzy package here: http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/, which is a great post explanining different scenarios when trying to do fuzzy matching. They discussed several scenarios for Partial String Similarity:

1) Out Of Order
2) Token Sort
3) Token Set

And then, from this post: https://pathindependence.wordpress.com/2015/10/31/tutorial-fuzzywuzzy-string-matching-in-python-improving-merge-accuracy-across-data-products-and-naming-conventions/ I learned how to use fuzzywuzzy's process.extract() function to basically select the top k matches.

I cannot find too much info regarding how the process.extract() function works. Here's the definition/information I found on their GitHub page (https://github.com/seatgeek/fuzzywuzzy/blob/master/fuzzywuzzy/process.py), that this function:

Find best matches in a list or dictionary of choices, return a list of tuples containing the match and it's score. If a dictionary is used, also returns the key for each match.

However, it does not provide details regarding HOW it's finding the best? Did it take all the 3 scenarios I've mentioned above to find this?

The reason why I ask, is because, when I used this function, sometimes there are two strings that are very similar but are not matched.

for example in my current sample data set, for the to-be-match-string

"Total replenishment lead time (in workdays)"

it is matched to

"PLANNING_TIME_FENCE_CODE", "BUILD_IN_WIP_FLAG"

but not to (the right answer)

"FULL_LEAD_TIME"

Even though the right answer has "lead time" just like the to-be-match-string does, it is not matched to the to-be-match-string at all. WHY? and somehow, the other ones that do not look like the to-be-match-string get to be matched. WHY? I am quite clueless now.

733

asked Dec 15 '16 19:12

alwaysaskingquestions

1 Answers

The other answer is wrong in a key respect - the inference that the result of process.extract was the same as fuzz.partial_ratio in one case, therefore they are doing the same thing by default.

process.extract actually uses WRatio() by default, which is a weighted combination of the four fuzz ratios. This is actually a cool functionality that empirically works pretty well across fuzzy matching scenarios.

Still, you can manually specify the string comparison function via the scorer argument to extract

Source for process.extract:https://github.com/seatgeek/fuzzywuzzy/blob/master/fuzzywuzzy/process.py

answered Sep 29 '22 09:09

Jack Rowntree

Related questions
                            
                                Straightforward way to save the contents of an S3 key to a string in boto3?
                            
                                Using a variable in xpath in Python Selenium
                            
                                boto3 aws api - Listing available instance types
                            
                                How to read an image inside a zip file with PIL/Pillow
                            
                                Fast b-spline algorithm with numpy/scipy
                            
                                kafka-python consumer not receiving messages
                            
                                finding streaks in pandas dataframe
                            
                                CVXOPT QP Solver: TypeError: 'A' must be a 'd' matrix with 1000 columns
                            
                                Initialising an n-length tuple of lists
                            
                                Memory usage with concurrent.futures.ThreadPoolExecutor in Python3
                            
                                Selenium Python: How to wait for a page to load after a click?
                            
                                GSpread ImportError: No module named oauth2client.service_account
                            
                                Importing Python modules for Azure Function
                            
                                what's the usage of __traceback_hide__
                            
                                R's order equivalent in python
                            
                                F test with python, finding the critical value
                            
                                I cannot ignore pycache and db.sqlite on Django even though it refers them at .gitignore
                            
                                Swapping/Ordering multi-index columns in pandas
                            
                                python map() on zipped object
                            
                                What is the difference between var, cvar and ivar in python's sphinx?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With