Python Difflib's SequenceMatcher does not find Longest Common Substrings

Tags:

I want to use difflib.SequenceMatcher to extract longest common substrings from two strings. I'm not sure whether I found a bug or misunderstood the documentation of find_longest_match. This is the point that I find confusing:

In other words, of all maximal matching blocks, return one that starts earliest in a, and of all those maximal matching blocks that start earliest in a, return the one that starts earliest in b.

(https://docs.python.org/3.5/library/difflib.html#difflib.SequenceMatcher.find_longest_match)

Comparing the strings X this is a test and this is a test X, the substring X is in fact a maximal block: it can't be extended (i.e., it is inclusion-maximal). Also, it is the first such maximal block in text A. But it is certainly not a longest common substring. I strongly suspect this is not what find_longest_match is supposed to find.

In fact, in this example, find_longest_match does find a longest common substring:

>>> l = len("X this is a test")
>>> matcher = difflib.SequenceMatcher(None, "X this is a test", "this is a test X")
>>> matcher.find_longest_match(0, l, 0, l)
Match(a=2, b=0, size=14)

However, it seems like with some other strings, I can provoke the "find the first maximal block"-behavious described above (sorry for the long strings, if I shorten them, the example somehow breaks):

>>> s1 = "e-like graph visualization using a spanning tree-driven layout technique with constraints specified by layers and the ordering of groups of nodes within layers. We propose a new method of how the orde"
>>> s2 = "itree graph visualization using a spanning tree-driven layout technique with constraints speci ed by layers and the ordering of groups of nodes within layers. We propose a new method of how the drivin"
>>> matcher = difflib.SequenceMatcher(None, s1, s2)
>>> matcher.find_longest_match(1, 149, 5, 149)
Match(a=1, b=47, size=1)

In this case, it matches the first - in s1[1] to the - in s2[47], and that's it. A longest common substring would probably be something starting with graph visualization using…

Did I find a bug, or is there another possible interpretation of the documentation that describes this behavior?

I'm using Python 3.5.2 on Ubuntu.

775

asked Jun 27 '18 09:06

Lukas Barth

1 Answers

Okay, I figured it out. In case anybody has the same problem: The SequenceMatcher has an autojunk parameter that does weird things:

The heuristic counts how many times each individual item appears in the sequence. If an item’s duplicates (after the first one) account for more than 1% of the sequence and the sequence is at least 200 items long, this item is marked as “popular” and is treated as junk for the purpose of sequence matching.

As far as I can tell, the matcher will never find any matches containing any "junk". Not sure why that's useful, but it is enabled by default. That also explains why the example from above breaks when I make the strings shorter. It does however speed up the LCS search considerably.

So, in conclusion: You probably want to pass autojunk=False in the constructor.

answered Oct 21 '22 21:10

Lukas Barth

Related questions
                            
                                Using asyncio to run a function at the start (00 seconds) of every minute
                            
                                How to continuously monitor a new mail in outlook and unread mails of a specific folder in python
                            
                                OpenCV MatchTemplate in C# is too slow compared to Python
                            
                                When would the python tracemalloc module allocations statistics not match what's shown in ps or pmap?
                            
                                Keras: How to get layer index when already know layer name?
                            
                                What does the parenthesis after the function mean
                            
                                django - prefetch only the newest record?
                            
                                How to extract rar files inside google colab
                            
                                What is the best way in python to write docstrings for lambda functions?
                            
                                Assign value to specific cell in PySpark dataFrame
                            
                                What does tqdm's total parameter do?
                            
                                Django and Folium integration
                            
                                How to pass additional parameters to handle_client coroutine?
                            
                                How to target data attribute with Scrapy
                            
                                Python3 __pycache__ generating even if PYTHONDONTWRITEBYTECODE=1
                            
                                Scipy sigmoid curve fitting
                            
                                Merge list into sparse list efficiently
                            
                                What is the difference between APIView class and generics.GenericAPIView
                            
                                Auto-build an Mkdocs documentation in Travis CI
                            
                                Django: annotate Sum Case When depending on the status of a field

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python Difflib's SequenceMatcher does not find Longest Common Substrings

Tags:

python

difflib

longest-substring

Lukas Barth

People also ask

1 Answers

Lukas Barth

Recent Activity

Donate For Us