I am a little puzzled by two different answers returned by SequenceMatcher
depending on the order of the arguments. Why is it so?
SequenceMatcher is not commutative:
>>> from difflib import SequenceMatcher
>>> SequenceMatcher(None, "Ebojfm Mzpm", "Ebfo ef Mfpo").ratio()
0.6086956521739131
>>> SequenceMatcher(None, "Ebfo ef Mfpo", "Ebojfm Mzpm").ratio()
0.5217391304347826
class difflib. SequenceMatcher. This is a flexible class for comparing pairs of sequences of any type, so long as the sequence elements are hashable.
Difflib is a Python module that contains several easy-to-use functions and classes that allow users to compare sets of data. The module presents the results of these sequence comparisons in a human-readable format, utilizing deltas to display the differences more cleanly.
The get_close_matches() function returns a list of close matched strings that satisfy the cutoff. The order of close matched string is based on similarity score, so the most similar string comes first in the list.
This module in the python standard library provides classes and functions for comparing sequences like strings, lists etc.
SequenceMatcher.ratio
internally uses SequenceMatcher.get_matching_blocks
to calculate the ratio, I will walk you through the steps to see how that happens:
SequenceMatcher.get_matching_blocks
Return list of triples describing matching subsequences. Each triple is of the form
(i, j, n)
, and means thata[i:i+n] == b[j:j+n]
. The triples are monotonically increasing ini
andj
.The last triple is a dummy, and has the value
(len(a), len(b), 0)
. It is the only triple withn == 0
.If (i, j, n)
and(i', j', n')
are adjacent triples in the list, and the second is not the last triple in the list, theni+n != i'
orj+n != j'
; in other words, adjacent triples always describe non-adjacent equal blocks.
ratio
internally uses SequenceMatcher.get_matching_blocks
's results, and sums the sizes of all matched sequences returned bySequenceMatcher.get_matching_blocks
. This is the exact source code from difflib.py
:
matches = sum(triple[-1] for triple in self.get_matching_blocks())
The above line is critical, because the result of the above expression is used to compute the ratio. We'll see that shortly and how it impacts the calculation of the ratio.
>>> m1 = SequenceMatcher(None, "Ebojfm Mzpm", "Ebfo ef Mfpo")
>>> m2 = SequenceMatcher(None, "Ebfo ef Mfpo", "Ebojfm Mzpm")
>>> matches1 = sum(triple[-1] for triple in m1.get_matching_blocks())
>>> matches1
7
>>> matches2 = sum(triple[-1] for triple in m2.get_matching_blocks())
>>> matches2
6
As you can see, we have 7 and 6. These are simply the sums of the matched subsequences as returned by get_matching_blocks
. Why does this matter? Here's why, the ratio is computed in the following way, (this is from difflib
source code):
def _calculate_ratio(matches, length):
if length:
return 2.0 * matches / length
return 1.0
length
is len(a) + len(b)
where a
is the first sequence and b
being the second sequence.
Okay, enough talk, we need actions:
>>> length = len("Ebojfm Mzpm") + len("Ebfo ef Mfpo")
>>> m1.ratio()
0.6086956521739131
>>> (2.0 * matches1 / length) == m1.ratio()
True
Similarly for m2
:
>>> 2.0 * matches2 / length
0.5217391304347826
>>> (2.0 * matches2 / length) == m2.ratio()
True
Note: Not all SequenceMatcher(None a,b).ratio() == SequenceMatcher(None b,a).ratio()
are False
, sometimes they can be True
:
>>> s1 = SequenceMatcher(None, "abcd", "bcde").ratio()
>>> s2 = SequenceMatcher(None, "bcde", "abcd").ratio()
>>> s1 == s2
True
In case you're wondering why, this is because
sum(triple[-1] for triple in self.get_matching_blocks())
is the same for both SequenceMatcher(None, "abcd", "bcde")
and SequenceMatcher(None, "bcde", "abcd")
which is 3.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With