Here's a string T
:
'men shirt team brienne funny sarcasm shirt features graphic tees mugs babywear much real passion brilliant design detailed illustration strong appreciation things creative br shop thousands designs found across different shirt babywear mugs funny pop culture abstract witty many designs brighten day well day almost anyone else meet ul li quality short sleeve crew neck shirts 100 cotton soft durable comfortable feel fit standard size doubt l xl available li li sustainability label company conceived belief textiles industry start acting lot responsibly made cotton li li clothing printed using state art direct garment equipment crack peel washed li li graphic tee designs professionally printed unique design look great make someone smile funny cute vintage expressive artwork li ul'
I've highlighted a part of the above string since the above is a preprocessed version of a string, and thus may be difficult to read.
I'm getting the following values:
fuzz.partial_ratio('short sleeve', T)
gives 50
fuzz.partial_ratio('long sleeve', T)
gives 73
fuzz.partial_ratio('dsfsdf sleeve', T)
gives 62
fuzz.partial_ratio('sleeve', T)
gives 50
I'm very confused by this. Shouldn't the first and fourth values be 100? Surely I'm missing something but I cannot figure it out.
EDIT: Here's another example that I run after uninstalled python-Levenshtein library:
'first succeed way wife told v 2 long sleeve shirt id 1084 first succeed way wife told v 2 long sleeve shirt design printed quality 100 long sleeve cotton shirt sports gray 90 cotton 10 polyester standard long sleeve shirts fashion fit tight fitting style please check size chart listed additional image feel free contact us first sizing questions satisfaction 100 guaranteed shirts usually ship business day ordered noon est next business day ordered noon est long sleeve shirts 100 cotton standard shirt fashion fit combined shipping multiple items'
fuzz.partial_ratio('long sleeve', T)
gives 27
fuzz.partial_ratio('short sleeve', T)
gives 33
fuzz.partial_ratio('sleeveless', T)
gives 40
fuzz.partial_ratio('dsfasd sleeve', T)
gives 23
Unfortunately the problem doesn't seem to be exclusive to python-Levenshtein library.
There is a really strange and subtle bug in the fuzzywuzzy
library somewhere.
If we run the following
from fuzzywuzzy import fuzz
fuzz.partial_ratio('funny', 'aa aaaaa aaaa aaaaaaa funny aaaaaaa aaaaaaaa aaaaaaa aaaa aaaa aaayaaaa auaa aaaa aaaaaaaa aaaaaaaaa aaaaaa aaaaaaaa aaaaa aaaa aa aaaaaaaaaaa aaaaaa aaaffaaaaaaa aaaaa aaayaaaa auaa funny aaaa aaaaaa')
it returns 0
Whereas if we remove a single letter from the start of this string:
fuzz.partial_ratio('funny', 'a aaaaa aaaa aaaaaaa funny aaaaaaa aaaaaaaa aaaaaaa aaaa aaaa aaayaaaa auaa aaaa aaaaaaaa aaaaaaaaa aaaaaa aaaaaaaa aaaaa aaaa aa aaaaaaaaaaa aaaaaa aaaffaaaaaaa aaaaa aaayaaaa auaa funny aaaa aaaaaa')
It returns 100
(sorry about the long and horrible strings. I've tried to reduce it to as simple a string as possible, but I can't seem to see the logic driving this bug)
There seems to be similar bug reports on Github.
Installing python-Levenshtein
seemed to fix my example above (fuzzywuzzy reverts to difflib
if python-Levenshtein
isn't installed), but doesn't change your original example.
With python-Levenshtein
installed, I can reduce your example to:
fuzz.partial_ratio('sleeve', 's l e e v sleeve e ')
which return 50
.
Removing the first letter from the longer string:
fuzz.partial_ratio('sleeve', 'l e e v sleeve e ')
returns 100
.
This provides some sort of hints as to what may be going on, but I suspect it will require a deep dive into python-Levenshtein
to figure out.
My recommendation? File a bug report. And then find another library to compare strings. RapidFuzz could be a suitable alternative.
UPDATE:
I think the bug may be related to the use of opcodes
from the python-Levenshtein
library:
from Levenshtein import opcodes
opcodes('sleeve', 's l e e v sleeve e ')
Returns:
[('equal', 0, 1, 0, 1),
('insert', 1, 1, 1, 2),
('equal', 1, 2, 2, 3),
('insert', 2, 2, 3, 4),
('equal', 2, 3, 4, 5),
('insert', 3, 3, 5, 6),
('equal', 3, 4, 6, 7),
('insert', 4, 4, 7, 8),
('equal', 4, 5, 8, 9),
('insert', 5, 5, 9, 12),
('equal', 5, 6, 12, 13),
('insert', 6, 6, 13, 19)]
When used in fuzzywuzzy
, this is clearly not the intended result, even though these are one set of minimum edit operations. In fuzzywuzzy
, priority should be placed on continuous blocks, whereas the formal definition of Levenshtein distance doesn't give priority to continuous vs non-continuous blocks (at least not to my understanding). Note that difflib.SequenceMatcher.get_opcodes()
gives a different result.
I suspect some very careful thought is going to be needed to fix this bug and get it right.
The general idea behind the algorithm is to find the best matching substring in a longer string. However, there are a couple of issues with the way this is done in FuzzyWuzzy. In the following description of the algorithm s1
refers to the shorter string, s2
to the longer string and s2_substr
to a substring of s2.
They implement this algorithm in the following steps:
s1
in s2
s1_len
from s2
. This substring s2_substr
can be shorter than s1_len
when it is placed at the end of s2
.s2_substr
and compare each of them to s1
using a normalized InDel-Distance (like Levenshtein Distance but without Substitutions)I am aware of the following shortcomings of this implementation
s2_substr
always starts at the starting point of one of the longest common subsequences. While this is true in many cases, this is not always the case. (You did not run into this issue and I did not see a bug report on this in FuzzyWuzzy or RapidFuzz yet. The result only differs a lot in some very specific edge cases most users probably do not run into often)Which algorithm is better suited largely depends on your needs. A first simple solution is to replace FuzzyWuzzy with my library RapidFuzz. This fixes the issues with the LCS algorithm I described. However, it uses the same algorithm as FuzzyWuzzy to calculate the similarity, so the third issue exists as well. I am searching for a better algorithm (for more details check the following question).
As noted by Andrew Guy the Smith-Waterman distance could be an alternative as well. However, it has some big differences to fuzz.partial_ratio
:
s1_len
, while the Smith Waterman algorithm searches for the best aligned substring, without caring about the length of it. This is not bad, you should just be aware of it. One disadvantage is, that it is harder to normalize the result (bring it to a similarity score between 0 and 100), since the length of the substring is not known. This is not really a problem, since you can just search for the lowest distance instead of the highest similarity.The reason I am not using the Smith-Waterman algorithm in RapidFuzz to calculate the fuzz.partial_ratio, is that I want it to be a direct replacement for the implementation in FuzzyWuzzy. However, I am planning to add the Smith-Waterman algorithm in the future as well.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With