In the python difflib library, is the SequenceMatcher class behaving unexpectedly, or am I misreading what the supposed behavior is?
Why does the isjunk argument seem to not make any difference in this case?
difflib.SequenceMatcher(None, "AA", "A A").ratio() return 0.8
difflib.SequenceMatcher(lambda x: x in ' ', "AA", "A A").ratio() returns 0.8
My understanding is that if space is omitted, the ratio should be 1.
This is happening because the ratio
function uses total sequences' length while calculating the ratio, but it doesn't filter elements using isjunk
. So, as long as the number of matches in the matching blocks results in the same value (with and without isjunk
), the ratio measure will be the same.
I assume that sequences are not filtered by isjunk
because of performance reasons.
def ratio(self):
"""Return a measure of the sequences' similarity (float in [0,1]).
Where T is the total number of elements in both sequences, and
M is the number of matches, this is 2.0*M / T.
"""
matches = sum(triple[-1] for triple in self.get_matching_blocks())
return _calculate_ratio(matches, len(self.a) + len(self.b))
self.a
and self.b
are the strings (sequences) passed to the SequenceMatcher object ("AA" and "A A" in your example). The isjunk
function lambda x: x in ' '
is only used to determine the matching blocks. Your example is quite simple, so the resulting ratio and matching blocks are the same for both calls.
difflib.SequenceMatcher(None, "AA", "A A").get_matching_blocks()
[Match(a=0, b=0, size=1), Match(a=1, b=2, size=1), Match(a=2, b=3, size=0)]
difflib.SequenceMatcher(lambda x: x == ' ', "AA", "A A").get_matching_blocks()
[Match(a=0, b=0, size=1), Match(a=1, b=2, size=1), Match(a=2, b=3, size=0)]
Same matching blocks, the ratio is: M = 2, T = 6 => ratio = 2.0 * 2 / 6
Now consider the following example:
difflib.SequenceMatcher(None, "AA ", "A A").get_matching_blocks()
[Match(a=1, b=0, size=2), Match(a=3, b=3, size=0)]
difflib.SequenceMatcher(lambda x: x == ' ', "AA ", "A A").get_matching_blocks()
[Match(a=0, b=0, size=1), Match(a=1, b=2, size=1), Match(a=3, b=3, size=0)]
Now matching blocks are different, but the ratio will be the same because the number of matches is still equal:
When isjunk
is None: M = 2, T = 6 => ratio = 2.0 * 2 / 6
When isjunk
is lambda x: x == ' '
: M = 1 + 1, T = 6 => ratio = 2.0 * 2 / 6
Finally, a different number of matches:
difflib.SequenceMatcher(None, "AA ", "A A ").get_matching_blocks()
[Match(a=1, b=0, size=2), Match(a=3, b=4, size=0)]
difflib.SequenceMatcher(lambda x: x == ' ', "AA ", "A A ").get_matching_blocks()
[Match(a=0, b=0, size=1), Match(a=1, b=2, size=2), Match(a=3, b=4, size=0)]
The number of matches is different
When isjunk
is None: M = 2, T = 7 => ratio = 2.0 * 2 / 7
When isjunk
is lambda x: x == ' '
: M = 1 + 2, T = 6 => ratio = 2.0 * 3 / 7
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With