I've been trying to create a nested or recursive effect with SequenceMatcher.
The final goal is comparing two sequences, both may contain instances of different types.
For example, the sequences could be:
l1 = [1, "Foo", "Bar", 3]
l2 = [1, "Fo", "Bak", 2]
Normally, SequenceMatcher will identify only [1] as a common sub-sequence for l1 and l2.
I'd like SequnceMatcher to be applied twice for string instances, so that "Foo"
and "Fo"
will be considered equal, as well as "Bar"
and "Bak"
, and the longest common sub-sequence will be of length 3 [1, Foo/Fo, Bar/Bak]
. That is, I'd like SequenceMatcher to be more forgiving when comparing string members.
What I tried doing is write a wrapper for the built-in str class:
from difflib import SequenceMatcher
class myString:
def __init__(self, string):
self.string = string
def __hash__(self):
return hash(self.string)
def __eq__(self, other):
return SequenceMatcher(a=self.string, b=self.string).ratio() > 0.5
Edit: perhaps a more elegant way is:
class myString(str):
def __eq__(self, other):
return SequenceMatcher(a=self, b=other).ratio() > 0.5
By doing this, the following is made possible:
>>> Foo = myString("Foo")
>>> Fo = myString("Fo")
>>> Bar = myString("Bar")
>>> Bak = myString("Bak")
>>> l1 = [1, Foo, Bar, 3]
>>> l2 = [1, Fo, Bak, 2]
>>> SequenceMatcher(a=l1, b=l2).ratio()
0.75
So, evidently it's working, but I have a bad feeling about overriding the hash function. When is the hash used? Where can it come back and bite me?
SequenceMatcher's documentation states the following:
This is a flexible class for comparing pairs of sequences of any type, so long as the sequence elements are hashable.
And by definition hashable elements are required to fulfill the following requirement:
Hashable objects which compare equal must have the same hash value.
In addition, do I need to override cmp as well?
I'd love to hear about other solutions that come to mind.
Thanks.
SequenceMatcher is a flexible class for comparing pairs of sequences of any type, so long as the sequence elements are hashable. The basic algorithm predates, and is a little fancier than, an algorithm published in the late 1980's by Ratcliff and Obershelp under the hyperbolic name "gestalt pattern matching".
This module provides classes and functions for comparing sequences. It can be used for example, for comparing files, and can produce information about file differences in various formats, including HTML and context and unified diffs.
SequenceMatcher is a class that is available in the difflib Python package. The difflib module provides classes and functions for comparing sequences. It can be used to compare files and can produce information about file differences in various formats. This class can be used to compare two input sequences or strings.
Difflib is a built-in module in the Python programming language consisting of different simple functions and classes that allow users to compare data sets. The module offers the outputs of these sequence comparisons in a format that can be read by a human, using deltas to show the differences more efficiently.
Your solution isn't bad - you could also look at re-working the SequenceMatcher to recursively apply when elements of a sequence are themselves iterables, with some custom logic. That would be sort of a pain. If you only want this subset of SequenceMatcher's functionality, writing a custom diff tool might not be a bad idea either.
Overriding __hash__
to make "Foo"
and "Fo"
equal will cause collisions in dictionaries (hash tables) and such. If you're literally only interested in the first 2 characters and are set on using SequenceMatcher, returning cls.super(self[2:])
might be the way to go.
All that said, your best bet is probably a one-off diff tool. I can sketch out the basics of something like that if you're interested. You just need to know what the constraints are in the circumstances (does the subsequence always start on the first element, that kind of thing).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With