Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

difflib.SequenceMatcher isjunk argument not considered?

In the python difflib library, is the SequenceMatcher class behaving unexpectedly, or am I misreading what the supposed behavior is?

Why does the isjunk argument seem to not make any difference in this case?

difflib.SequenceMatcher(None, "AA", "A A").ratio() return 0.8

difflib.SequenceMatcher(lambda x: x in ' ', "AA", "A A").ratio() returns 0.8

My understanding is that if space is omitted, the ratio should be 1.

like image 474
bluelogic Avatar asked Jun 30 '16 17:06

bluelogic


1 Answers

This is happening because the ratio function uses total sequences' length while calculating the ratio, but it doesn't filter elements using isjunk. So, as long as the number of matches in the matching blocks results in the same value (with and without isjunk), the ratio measure will be the same.

I assume that sequences are not filtered by isjunk because of performance reasons.

def ratio(self):   
    """Return a measure of the sequences' similarity (float in [0,1]).

    Where T is the total number of elements in both sequences, and
    M is the number of matches, this is 2.0*M / T.
    """

    matches = sum(triple[-1] for triple in self.get_matching_blocks())
    return _calculate_ratio(matches, len(self.a) + len(self.b))

self.a and self.b are the strings (sequences) passed to the SequenceMatcher object ("AA" and "A A" in your example). The isjunk function lambda x: x in ' ' is only used to determine the matching blocks. Your example is quite simple, so the resulting ratio and matching blocks are the same for both calls.

difflib.SequenceMatcher(None, "AA", "A A").get_matching_blocks()
[Match(a=0, b=0, size=1), Match(a=1, b=2, size=1), Match(a=2, b=3, size=0)]

difflib.SequenceMatcher(lambda x: x == ' ', "AA", "A A").get_matching_blocks()
[Match(a=0, b=0, size=1), Match(a=1, b=2, size=1), Match(a=2, b=3, size=0)]

Same matching blocks, the ratio is: M = 2, T = 6 => ratio = 2.0 * 2 / 6

Now consider the following example:

difflib.SequenceMatcher(None, "AA ", "A A").get_matching_blocks()
[Match(a=1, b=0, size=2), Match(a=3, b=3, size=0)]

difflib.SequenceMatcher(lambda x: x == ' ', "AA ", "A A").get_matching_blocks()
[Match(a=0, b=0, size=1), Match(a=1, b=2, size=1), Match(a=3, b=3, size=0)]

Now matching blocks are different, but the ratio will be the same because the number of matches is still equal:

When isjunk is None: M = 2, T = 6 => ratio = 2.0 * 2 / 6

When isjunk is lambda x: x == ' ': M = 1 + 1, T = 6 => ratio = 2.0 * 2 / 6

Finally, a different number of matches:

difflib.SequenceMatcher(None, "AA ", "A A ").get_matching_blocks()
[Match(a=1, b=0, size=2), Match(a=3, b=4, size=0)]

difflib.SequenceMatcher(lambda x: x == ' ', "AA ", "A A ").get_matching_blocks()
[Match(a=0, b=0, size=1), Match(a=1, b=2, size=2), Match(a=3, b=4, size=0)]

The number of matches is different

When isjunk is None: M = 2, T = 7 => ratio = 2.0 * 2 / 7

When isjunk is lambda x: x == ' ': M = 1 + 2, T = 6 => ratio = 2.0 * 3 / 7

like image 60
Jose S Avatar answered Oct 04 '22 14:10

Jose S