Difflib's SequenceMatcher - Customized equality

Tags:

I've been trying to create a nested or recursive effect with SequenceMatcher.

The final goal is comparing two sequences, both may contain instances of different types.

For example, the sequences could be:

l1 = [1, "Foo", "Bar", 3]
l2 = [1, "Fo", "Bak", 2]

Normally, SequenceMatcher will identify only [1] as a common sub-sequence for l1 and l2.

I'd like SequnceMatcher to be applied twice for string instances, so that "Foo" and "Fo" will be considered equal, as well as "Bar" and "Bak", and the longest common sub-sequence will be of length 3 [1, Foo/Fo, Bar/Bak]. That is, I'd like SequenceMatcher to be more forgiving when comparing string members.

What I tried doing is write a wrapper for the built-in str class:

from difflib import SequenceMatcher
class myString:
    def __init__(self, string):
        self.string = string
    def __hash__(self):
        return hash(self.string)
    def __eq__(self, other):
        return SequenceMatcher(a=self.string, b=self.string).ratio() > 0.5

Edit: perhaps a more elegant way is:

class myString(str):
    def __eq__(self, other):
        return SequenceMatcher(a=self, b=other).ratio() > 0.5

By doing this, the following is made possible:

>>> Foo = myString("Foo")
>>> Fo = myString("Fo")
>>> Bar = myString("Bar")
>>> Bak = myString("Bak")
>>> l1 = [1, Foo, Bar, 3]
>>> l2 = [1, Fo, Bak, 2]
>>> SequenceMatcher(a=l1, b=l2).ratio()
0.75

So, evidently it's working, but I have a bad feeling about overriding the hash function. When is the hash used? Where can it come back and bite me?

SequenceMatcher's documentation states the following:

This is a flexible class for comparing pairs of sequences of any type, so long as the sequence elements are hashable.

And by definition hashable elements are required to fulfill the following requirement:

Hashable objects which compare equal must have the same hash value.

In addition, do I need to override cmp as well?

I'd love to hear about other solutions that come to mind.

Thanks.

804

asked Sep 07 '13 10:09

YaronK

1 Answers

Your solution isn't bad - you could also look at re-working the SequenceMatcher to recursively apply when elements of a sequence are themselves iterables, with some custom logic. That would be sort of a pain. If you only want this subset of SequenceMatcher's functionality, writing a custom diff tool might not be a bad idea either.

Overriding __hash__ to make "Foo" and "Fo" equal will cause collisions in dictionaries (hash tables) and such. If you're literally only interested in the first 2 characters and are set on using SequenceMatcher, returning cls.super(self[2:]) might be the way to go.

All that said, your best bet is probably a one-off diff tool. I can sketch out the basics of something like that if you're interested. You just need to know what the constraints are in the circumstances (does the subsequence always start on the first element, that kind of thing).

107

answered Sep 22 '22 05:09

a p

Related questions
                            
                                virtualenv doesn't create PYTHONPATH / pip installed packages don't work
                            
                                Python 2.x - creating static instance of class in same class
                            
                                Evaluate SmoothBivariateSpline over a grid
                            
                                Python looping combinations of 8 objects into 3 groups, 3-3-2
                            
                                numpy.all with integer arguments returns an integer
                            
                                Vim: how to install the UltiSnips plugin?
                            
                                Django: how to order_by on a related field of a related field
                            
                                Implementing the Waterman-Eggert algorithm
                            
                                What is the equivalent of a C# Task in python
                            
                                Passing arguments which include double quotes to subprocess using Python
                            
                                Python and Numba for vectorized functions
                            
                                Django: ValueError: Cannot create form field because its related model has not been loaded yet
                            
                                requests python lib adding in Accept header
                            
                                How do I compile multiple py files as one?
                            
                                python2.7 peek at stdin
                            
                                numpy.genfromtxt: Ambiguous delimiters?
                            
                                VTK window not filling wx.Panel
                            
                                How can I call a python script from a python script
                            
                                Multiple sequence comparison of arbitrary string with oriented characters
                            
                                Efficient session variable server-side caching with Python+Flask

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Difflib's SequenceMatcher - Customized equality

Tags:

python

difflib

sequencematcher

YaronK

People also ask

1 Answers

a p

Recent Activity

Donate For Us