Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

String Matching using fuzzywuzzy- is it using Levenshtein distance or the Ratcliff/Obershelp pattern-matching algorithm?

fuzzywuzzy is a very popular library for string matching. As per the documentation of the library, it is mentioned that it uses Levenshtein distance for computing the differences between sequences. But upon close inspection, I find that it actually uses the SequenceMatcher function from the difflib library. And this function, as per documentation uses the Ratcliff/Obershelp pattern-matching algorithm.

As per the definitions, Levenshtein distance is the minimum number of edits needed to transform one string into the other, and Ratcliff/Obershelp pattern-matching algorithm computes the doubled number of matching characters divided by the total number of characters in the two strings. A close related post comparing both.

And when I run an example, I get the same result for SequenceMatcher and ratio function in fuzzywuzzy.

from difflib import SequenceMatcher
from fuzzywuzzy import fuzz
s = SequenceMatcher(None, "abcd", "bcde")
s.ratio()
# 0.75
fuzz.ratio("abcd", "bcde")
# 75

If I compute the Levenshtein distance manually between the two strings, I guess it will be just 2. In this case, how does it become that it uses Levenshtein distance as the contributors write in the documentation?

like image 404
prashanth Avatar asked Oct 17 '22 08:10

prashanth


1 Answers

FuzzyWuzzy.ratio using python-Levenshtein doesn't return the Levenshtein score, but rather the Levenshtein ratio, which is (a+b - LevenshteinScore)/(a+b), where a and b are the lengths of the two strings being compared.

If you don't have python-Levenshtein installed then fuzzywuzzy doesn't use Levenshtein at all. Fuzzywuzzy's home page is misleading with regards to this, though it does recommend installing python-Levenshtein.

python-Levenshtein has some issues with installing; I used the second response to this stackoverflow question to solve it.

If you don't have python-Levenshtein installed FuzzyWuzzy uses difflib instead, which is the same for many input values, but not all. The developers recommend using python-Levenshtein. See this issue on fuzzywuzzy's git, which includes an example case where the results are different with the package as compared to without it. This probably shouldn't happen, or at least the documentation should make this clear, but FuzzyWuzzy's Devs seem content at least with the functionality.

like image 86
Isaac Avatar answered Oct 23 '22 10:10

Isaac