Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Difflib's SequenceMatcher does not find Longest Common Substrings

I want to use difflib.SequenceMatcher to extract longest common substrings from two strings. I'm not sure whether I found a bug or misunderstood the documentation of find_longest_match. This is the point that I find confusing:

In other words, of all maximal matching blocks, return one that starts earliest in a, and of all those maximal matching blocks that start earliest in a, return the one that starts earliest in b.

(https://docs.python.org/3.5/library/difflib.html#difflib.SequenceMatcher.find_longest_match)

Comparing the strings X this is a test and this is a test X, the substring X is in fact a maximal block: it can't be extended (i.e., it is inclusion-maximal). Also, it is the first such maximal block in text A. But it is certainly not a longest common substring. I strongly suspect this is not what find_longest_match is supposed to find.

In fact, in this example, find_longest_match does find a longest common substring:

>>> l = len("X this is a test")
>>> matcher = difflib.SequenceMatcher(None, "X this is a test", "this is a test X")
>>> matcher.find_longest_match(0, l, 0, l)
Match(a=2, b=0, size=14)

However, it seems like with some other strings, I can provoke the "find the first maximal block"-behavious described above (sorry for the long strings, if I shorten them, the example somehow breaks):

>>> s1 = "e-like graph visualization using a spanning tree-driven layout technique with constraints specified by layers and the ordering of groups of nodes within layers. We propose a new method of how the orde"
>>> s2 = "itree graph visualization using a spanning tree-driven layout technique with constraints speci ed by layers and the ordering of groups of nodes within layers. We propose a new method of how the drivin"
>>> matcher = difflib.SequenceMatcher(None, s1, s2)
>>> matcher.find_longest_match(1, 149, 5, 149)
Match(a=1, b=47, size=1)

In this case, it matches the first - in s1[1] to the - in s2[47], and that's it. A longest common substring would probably be something starting with graph visualization using…

Did I find a bug, or is there another possible interpretation of the documentation that describes this behavior?

I'm using Python 3.5.2 on Ubuntu.

like image 775
Lukas Barth Avatar asked Jun 27 '18 09:06

Lukas Barth


People also ask

How does SequenceMatcher work in Python?

SequenceMatcher FlowChartratio( ) returns the similarity score ( float in [0,1] ) between input strings. It sums the sizes of all matched sequences returned by function get_matching_blocks and calculates the ratio as: ratio = 2.0*M / T , where M = matches , T = total number of elements in both sequences.

What is Difflib SequenceMatcher?

class difflib. SequenceMatcher. This is a flexible class for comparing pairs of sequences of any type, so long as the sequence elements are hashable.

Is Difflib standard Python?

This module in the python standard library provides classes and functions for comparing sequences like strings, lists etc.


1 Answers

Okay, I figured it out. In case anybody has the same problem: The SequenceMatcher has an autojunk parameter that does weird things:

The heuristic counts how many times each individual item appears in the sequence. If an item’s duplicates (after the first one) account for more than 1% of the sequence and the sequence is at least 200 items long, this item is marked as “popular” and is treated as junk for the purpose of sequence matching.

As far as I can tell, the matcher will never find any matches containing any "junk". Not sure why that's useful, but it is enabled by default. That also explains why the example from above breaks when I make the strings shorter. It does however speed up the LCS search considerably.

So, in conclusion: You probably want to pass autojunk=False in the constructor.

like image 97
Lukas Barth Avatar answered Oct 21 '22 21:10

Lukas Barth