I am learning fuzzywuzzy
in Python.
I understand the concept of fuzz.ratio
, fuzz.partial_ratio
, fuzz.token_sort_ratio
and fuzz.token_set_ratio
. My question is when to use which function?
fuzz.partial_ratio
?fuzz.token_sort_ratio
? fuzz.token_set_ratio
?Anyone knows what criteria SeatGeek uses?
I am trying to build a real estate website, thinking to use fuzzywuzzy
to compare addresses.
FuzzyWuzzy is a library of Python which is used for string matching. Fuzzy string matching is the process of finding strings that match a given pattern. Basically it uses Levenshtein Distance to calculate the differences between sequences.
Often you may want to join together two datasets in pandas based on imperfectly matching strings. This is called fuzzy matching. The easiest way to perform fuzzy matching in pandas is to use the get_close_matches() function from the difflib package.
Great question.
I'm an engineer at SeatGeek, so I think I can help here. We have a great blog post that explains the differences quite well, but I can summarize and offer some insight into how we use the different types.
Under the hood each of the four methods calculate the edit distance between some ordering of the tokens in both input strings. This is done using the difflib.ratio
function which will:
Return a measure of the sequences' similarity (float in [0,1]).
Where T is the total number of elements in both sequences, and M is the number of matches, this is 2.0*M / T. Note that this is 1 if the sequences are identical, and 0 if they have nothing in common.
The four fuzzywuzzy methods call difflib.ratio
on different combinations of the input strings.
Simple. Just calls difflib.ratio
on the two input strings (code).
fuzz.ratio("NEW YORK METS", "NEW YORK MEATS") > 96
Attempts to account for partial string matches better. Calls ratio
using the shortest string (length n) against all n-length substrings of the larger string and returns the highest score (code).
Notice here that "YANKEES" is the shortest string (length 7), and we run the ratio with "YANKEES" against all substrings of length 7 of "NEW YORK YANKEES" (which would include checking against "YANKEES", a 100% match):
fuzz.ratio("YANKEES", "NEW YORK YANKEES") > 60 fuzz.partial_ratio("YANKEES", "NEW YORK YANKEES") > 100
Attempts to account for similar strings out of order. Calls ratio
on both strings after sorting the tokens in each string (code). Notice here fuzz.ratio
and fuzz.partial_ratio
both fail, but once you sort the tokens it's a 100% match:
fuzz.ratio("New York Mets vs Atlanta Braves", "Atlanta Braves vs New York Mets") > 45 fuzz.partial_ratio("New York Mets vs Atlanta Braves", "Atlanta Braves vs New York Mets") > 45 fuzz.token_sort_ratio("New York Mets vs Atlanta Braves", "Atlanta Braves vs New York Mets") > 100
Attempts to rule out differences in the strings. Calls ratio on three particular substring sets and returns the max (code):
Notice that by splitting up the intersection and remainders of the two strings, we're accounting for both how similar and different the two strings are:
fuzz.ratio("mariners vs angels", "los angeles angels of anaheim at seattle mariners") > 36 fuzz.partial_ratio("mariners vs angels", "los angeles angels of anaheim at seattle mariners") > 61 fuzz.token_sort_ratio("mariners vs angels", "los angeles angels of anaheim at seattle mariners") > 51 fuzz.token_set_ratio("mariners vs angels", "los angeles angels of anaheim at seattle mariners") > 91
This is where the magic happens. At SeatGeek, essentially we create a vector score with each ratio for each data point (venue, event name, etc) and use that to inform programatic decisions of similarity that are specific to our problem domain.
That being said, truth by told it doesn't sound like FuzzyWuzzy is useful for your use case. It will be tremendiously bad at determining if two addresses are similar. Consider two possible addresses for SeatGeek HQ: "235 Park Ave Floor 12" and "235 Park Ave S. Floor 12":
fuzz.ratio("235 Park Ave Floor 12", "235 Park Ave S. Floor 12") > 93 fuzz.partial_ratio("235 Park Ave Floor 12", "235 Park Ave S. Floor 12") > 85 fuzz.token_sort_ratio("235 Park Ave Floor 12", "235 Park Ave S. Floor 12") > 95 fuzz.token_set_ratio("235 Park Ave Floor 12", "235 Park Ave S. Floor 12") > 100
FuzzyWuzzy gives these strings a high match score, but one address is our actual office near Union Square and the other is on the other side of Grand Central.
For your problem you would be better to use the Google Geocoding API.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With