I want to use difflib.SequenceMatcher
to extract longest common substrings from two strings. I'm not sure whether I found a bug or misunderstood the documentation of find_longest_match
. This is the point that I find confusing:
In other words, of all maximal matching blocks, return one that starts earliest in a, and of all those maximal matching blocks that start earliest in a, return the one that starts earliest in b.
(https://docs.python.org/3.5/library/difflib.html#difflib.SequenceMatcher.find_longest_match)
Comparing the strings X this is a test
and this is a test X
, the substring X
is in fact a maximal block: it can't be extended (i.e., it is inclusion-maximal). Also, it is the first such maximal block in text A. But it is certainly not a longest common substring. I strongly suspect this is not what find_longest_match
is supposed to find.
In fact, in this example, find_longest_match
does find a longest common substring:
>>> l = len("X this is a test")
>>> matcher = difflib.SequenceMatcher(None, "X this is a test", "this is a test X")
>>> matcher.find_longest_match(0, l, 0, l)
Match(a=2, b=0, size=14)
However, it seems like with some other strings, I can provoke the "find the first maximal block"-behavious described above (sorry for the long strings, if I shorten them, the example somehow breaks):
>>> s1 = "e-like graph visualization using a spanning tree-driven layout technique with constraints specified by layers and the ordering of groups of nodes within layers. We propose a new method of how the orde"
>>> s2 = "itree graph visualization using a spanning tree-driven layout technique with constraints speci ed by layers and the ordering of groups of nodes within layers. We propose a new method of how the drivin"
>>> matcher = difflib.SequenceMatcher(None, s1, s2)
>>> matcher.find_longest_match(1, 149, 5, 149)
Match(a=1, b=47, size=1)
In this case, it matches the first -
in s1[1]
to the -
in s2[47]
, and that's it. A longest common substring would probably be something starting with graph visualization using…
Did I find a bug, or is there another possible interpretation of the documentation that describes this behavior?
I'm using Python 3.5.2 on Ubuntu.
SequenceMatcher FlowChartratio( ) returns the similarity score ( float in [0,1] ) between input strings. It sums the sizes of all matched sequences returned by function get_matching_blocks and calculates the ratio as: ratio = 2.0*M / T , where M = matches , T = total number of elements in both sequences.
class difflib. SequenceMatcher. This is a flexible class for comparing pairs of sequences of any type, so long as the sequence elements are hashable.
This module in the python standard library provides classes and functions for comparing sequences like strings, lists etc.
Okay, I figured it out. In case anybody has the same problem: The SequenceMatcher
has an autojunk
parameter that does weird things:
The heuristic counts how many times each individual item appears in the sequence. If an item’s duplicates (after the first one) account for more than 1% of the sequence and the sequence is at least 200 items long, this item is marked as “popular” and is treated as junk for the purpose of sequence matching.
As far as I can tell, the matcher will never find any matches containing any "junk". Not sure why that's useful, but it is enabled by default. That also explains why the example from above breaks when I make the strings shorter. It does however speed up the LCS search considerably.
So, in conclusion: You probably want to pass autojunk=False
in the constructor.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With