Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Levenshtein distance giving strange values

Here's a string T:

'men shirt team brienne funny sarcasm shirt features graphic tees mugs babywear much real passion brilliant design detailed illustration strong appreciation things creative br shop thousands designs found across different shirt babywear mugs funny pop culture abstract witty many designs brighten day well day almost anyone else meet ul li quality short sleeve crew neck shirts 100 cotton soft durable comfortable feel fit standard size doubt l xl available li li sustainability label company conceived belief textiles industry start acting lot responsibly made cotton li li clothing printed using state art direct garment equipment crack peel washed li li graphic tee designs professionally printed unique design look great make someone smile funny cute vintage expressive artwork li ul'

I've highlighted a part of the above string since the above is a preprocessed version of a string, and thus may be difficult to read.

I'm getting the following values:

fuzz.partial_ratio('short sleeve', T) gives 50

fuzz.partial_ratio('long sleeve', T) gives 73

fuzz.partial_ratio('dsfsdf sleeve', T) gives 62

fuzz.partial_ratio('sleeve', T) gives 50

I'm very confused by this. Shouldn't the first and fourth values be 100? Surely I'm missing something but I cannot figure it out.

EDIT: Here's another example that I run after uninstalled python-Levenshtein library:

'first succeed way wife told v 2 long sleeve shirt id 1084 first succeed way wife told v 2 long sleeve shirt design printed quality 100 long sleeve cotton shirt sports gray 90 cotton 10 polyester standard long sleeve shirts fashion fit tight fitting style please check size chart listed additional image feel free contact us first sizing questions satisfaction 100 guaranteed shirts usually ship business day ordered noon est next business day ordered noon est long sleeve shirts 100 cotton standard shirt fashion fit combined shipping multiple items'

fuzz.partial_ratio('long sleeve', T) gives 27

fuzz.partial_ratio('short sleeve', T) gives 33

fuzz.partial_ratio('sleeveless', T) gives 40

fuzz.partial_ratio('dsfasd sleeve', T) gives 23

Unfortunately the problem doesn't seem to be exclusive to python-Levenshtein library.

like image 457
user9343456 Avatar asked Mar 21 '21 23:03

user9343456


2 Answers

There is a really strange and subtle bug in the fuzzywuzzy library somewhere.

If we run the following

from fuzzywuzzy import fuzz

fuzz.partial_ratio('funny', 'aa aaaaa aaaa aaaaaaa funny aaaaaaa aaaaaaaa aaaaaaa aaaa aaaa aaayaaaa auaa aaaa aaaaaaaa aaaaaaaaa aaaaaa aaaaaaaa aaaaa aaaa aa aaaaaaaaaaa aaaaaa aaaffaaaaaaa aaaaa aaayaaaa auaa funny aaaa aaaaaa')

it returns 0

Whereas if we remove a single letter from the start of this string:

fuzz.partial_ratio('funny', 'a aaaaa aaaa aaaaaaa funny aaaaaaa aaaaaaaa aaaaaaa aaaa aaaa aaayaaaa auaa aaaa aaaaaaaa aaaaaaaaa aaaaaa aaaaaaaa aaaaa aaaa aa aaaaaaaaaaa aaaaaa aaaffaaaaaaa aaaaa aaayaaaa auaa funny aaaa aaaaaa')

It returns 100

(sorry about the long and horrible strings. I've tried to reduce it to as simple a string as possible, but I can't seem to see the logic driving this bug)

There seems to be similar bug reports on Github.

Installing python-Levenshtein seemed to fix my example above (fuzzywuzzy reverts to difflib if python-Levenshtein isn't installed), but doesn't change your original example.

With python-Levenshtein installed, I can reduce your example to:

fuzz.partial_ratio('sleeve', 's l e e v sleeve e ')

which return 50.

Removing the first letter from the longer string:

fuzz.partial_ratio('sleeve', 'l e e v sleeve e ')

returns 100.

This provides some sort of hints as to what may be going on, but I suspect it will require a deep dive into python-Levenshtein to figure out.

My recommendation? File a bug report. And then find another library to compare strings. RapidFuzz could be a suitable alternative.

UPDATE:

I think the bug may be related to the use of opcodes from the python-Levenshtein library:

from Levenshtein import opcodes

opcodes('sleeve', 's l e e v sleeve e ')

Returns:

[('equal', 0, 1, 0, 1),
 ('insert', 1, 1, 1, 2),
 ('equal', 1, 2, 2, 3),
 ('insert', 2, 2, 3, 4),
 ('equal', 2, 3, 4, 5),
 ('insert', 3, 3, 5, 6),
 ('equal', 3, 4, 6, 7),
 ('insert', 4, 4, 7, 8),
 ('equal', 4, 5, 8, 9),
 ('insert', 5, 5, 9, 12),
 ('equal', 5, 6, 12, 13),
 ('insert', 6, 6, 13, 19)]

When used in fuzzywuzzy, this is clearly not the intended result, even though these are one set of minimum edit operations. In fuzzywuzzy, priority should be placed on continuous blocks, whereas the formal definition of Levenshtein distance doesn't give priority to continuous vs non-continuous blocks (at least not to my understanding). Note that difflib.SequenceMatcher.get_opcodes() gives a different result.

I suspect some very careful thought is going to be needed to fix this bug and get it right.

like image 60
Andrew Guy Avatar answered Sep 24 '22 08:09

Andrew Guy


The general idea behind the algorithm is to find the best matching substring in a longer string. However, there are a couple of issues with the way this is done in FuzzyWuzzy. In the following description of the algorithm s1 refers to the shorter string, s2 to the longer string and s2_substr to a substring of s2. They implement this algorithm in the following steps:

  1. They use the Longest Common Subsequence algorithm to find the longest common substrings of s1 in s2
  2. They use the starting index of these common subsequences to extract substrings of length s1_len from s2. This substring s2_substr can be shorter than s1_len when it is placed at the end of s2.
  3. They iterate over those substrings s2_substr and compare each of them to s1 using a normalized InDel-Distance (like Levenshtein Distance but without Substitutions)

I am aware of the following shortcomings of this implementation

  1. When python-Levenshtein is used, FuzzyWuzzy uses it both to find the Longest Common Subsequences and calculate the Similarity. However, the implementation python-Levenshtein is using to find the Longest Common Subsequence is known to be broken (see here) and I am not aware of a simple fix to this. Someone proposed a fix, which however only fixes this one case and introduces problems in different cases. (This is the original issue you described)
  2. When python-Levenshtein is not used difflib is used to calculate the Longest Common Subsequence is calculated using difflib. However, as described here FuzzyWuzzy does not disable the auto junk heuristic, which leads to incorrect results when the strings have a big length difference. I just created a PR to fix this: https://github.com/seatgeek/fuzzywuzzy/pull/303, but the repository is not really maintained actively and SeatGeek appears fine with many of the shortcomings, since it works well enough for their use case. (This is the issue with difflib you added later)
  3. The similarity ratio in itself is flawed. It assumes, that the best matching substring s2_substr always starts at the starting point of one of the longest common subsequences. While this is true in many cases, this is not always the case. (You did not run into this issue and I did not see a bug report on this in FuzzyWuzzy or RapidFuzz yet. The result only differs a lot in some very specific edge cases most users probably do not run into often)

Which algorithm is better suited largely depends on your needs. A first simple solution is to replace FuzzyWuzzy with my library RapidFuzz. This fixes the issues with the LCS algorithm I described. However, it uses the same algorithm as FuzzyWuzzy to calculate the similarity, so the third issue exists as well. I am searching for a better algorithm (for more details check the following question). As noted by Andrew Guy the Smith-Waterman distance could be an alternative as well. However, it has some big differences to fuzz.partial_ratio:

  1. it uses the uniform Levenshtein distance (Insertions/Deletions/Substitutions all have a weight of 1), while fuzz.partial_ratio uses the InDel Distance. In case that's important to you, it can probably be adapted to use the InDel distance by giving Substitutions a weight of 2 when implementing it.
  2. fuzz.partial_ratio always takes a substring with the length s1_len, while the Smith Waterman algorithm searches for the best aligned substring, without caring about the length of it. This is not bad, you should just be aware of it. One disadvantage is, that it is harder to normalize the result (bring it to a similarity score between 0 and 100), since the length of the substring is not known. This is not really a problem, since you can just search for the lowest distance instead of the highest similarity.

The reason I am not using the Smith-Waterman algorithm in RapidFuzz to calculate the fuzz.partial_ratio, is that I want it to be a direct replacement for the implementation in FuzzyWuzzy. However, I am planning to add the Smith-Waterman algorithm in the future as well.

like image 25
maxbachmann Avatar answered Sep 25 '22 08:09

maxbachmann