Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difflib's SequenceMatcher - Customized equality

I've been trying to create a nested or recursive effect with SequenceMatcher.

The final goal is comparing two sequences, both may contain instances of different types.

For example, the sequences could be:

l1 = [1, "Foo", "Bar", 3]
l2 = [1, "Fo", "Bak", 2]

Normally, SequenceMatcher will identify only [1] as a common sub-sequence for l1 and l2.

I'd like SequnceMatcher to be applied twice for string instances, so that "Foo" and "Fo" will be considered equal, as well as "Bar" and "Bak", and the longest common sub-sequence will be of length 3 [1, Foo/Fo, Bar/Bak]. That is, I'd like SequenceMatcher to be more forgiving when comparing string members.

What I tried doing is write a wrapper for the built-in str class:

from difflib import SequenceMatcher
class myString:
    def __init__(self, string):
        self.string = string
    def __hash__(self):
        return hash(self.string)
    def __eq__(self, other):
        return SequenceMatcher(a=self.string, b=self.string).ratio() > 0.5

Edit: perhaps a more elegant way is:

class myString(str):
    def __eq__(self, other):
        return SequenceMatcher(a=self, b=other).ratio() > 0.5

By doing this, the following is made possible:

>>> Foo = myString("Foo")
>>> Fo = myString("Fo")
>>> Bar = myString("Bar")
>>> Bak = myString("Bak")
>>> l1 = [1, Foo, Bar, 3]
>>> l2 = [1, Fo, Bak, 2]
>>> SequenceMatcher(a=l1, b=l2).ratio()
0.75

So, evidently it's working, but I have a bad feeling about overriding the hash function. When is the hash used? Where can it come back and bite me?

SequenceMatcher's documentation states the following:

This is a flexible class for comparing pairs of sequences of any type, so long as the sequence elements are hashable.

And by definition hashable elements are required to fulfill the following requirement:

Hashable objects which compare equal must have the same hash value.

In addition, do I need to override cmp as well?

I'd love to hear about other solutions that come to mind.

Thanks.

like image 804
YaronK Avatar asked Sep 07 '13 10:09

YaronK


People also ask

What algorithm does SequenceMatcher use?

SequenceMatcher is a flexible class for comparing pairs of sequences of any type, so long as the sequence elements are hashable. The basic algorithm predates, and is a little fancier than, an algorithm published in the late 1980's by Ratcliff and Obershelp under the hyperbolic name "gestalt pattern matching".

What is the use of difflib?

This module provides classes and functions for comparing sequences. It can be used for example, for comparing files, and can produce information about file differences in various formats, including HTML and context and unified diffs.

How does Difflib SequenceMatcher work?

SequenceMatcher is a class that is available in the difflib Python package. The difflib module provides classes and functions for comparing sequences. It can be used to compare files and can produce information about file differences in various formats. This class can be used to compare two input sequences or strings.

What is Difflib library in Python?

Difflib is a built-in module in the Python programming language consisting of different simple functions and classes that allow users to compare data sets. The module offers the outputs of these sequence comparisons in a format that can be read by a human, using deltas to show the differences more efficiently.


1 Answers

Your solution isn't bad - you could also look at re-working the SequenceMatcher to recursively apply when elements of a sequence are themselves iterables, with some custom logic. That would be sort of a pain. If you only want this subset of SequenceMatcher's functionality, writing a custom diff tool might not be a bad idea either.

Overriding __hash__ to make "Foo" and "Fo" equal will cause collisions in dictionaries (hash tables) and such. If you're literally only interested in the first 2 characters and are set on using SequenceMatcher, returning cls.super(self[2:]) might be the way to go.

All that said, your best bet is probably a one-off diff tool. I can sketch out the basics of something like that if you're interested. You just need to know what the constraints are in the circumstances (does the subsequence always start on the first element, that kind of thing).

like image 107
a p Avatar answered Sep 22 '22 05:09

a p