Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

making difflib's SequenceMatcher ignore "junk" characters

I have a lot of strings that i want to match for similarity(each string is 30 characters on average). I found difflib's SequenceMatcher great for this task as it was simple and found the results good. But if i compare hellboy and hell-boy like this

>>> sm=SequenceMatcher(lambda x:x=='-','hellboy','hell-boy')
>>> sm.ratio()
0: 0.93333333333333335

I want such words to give a 100 percent match i.e ratio of 1.0. I understand that the junk character specified in the function above are not used for comparison but finding longest contiguous matching subsequence. Is there some way i can make SequenceMatcher to ignore some "junk" characters for comparison purpose?

like image 204
lovesh Avatar asked Apr 02 '12 20:04

lovesh


People also ask

How does Difflib SequenceMatcher work?

SequenceMatcher is a class that is available in the difflib Python package. The difflib module provides classes and functions for comparing sequences. It can be used to compare files and can produce information about file differences in various formats. This class can be used to compare two input sequences or strings.

What is Difflib?

Difflib is a built-in module in the Python programming language consisting of different simple functions and classes that allow users to compare data sets. The module offers the outputs of these sequence comparisons in a format that can be read by a human, using deltas to show the differences more efficiently.

How do you close a match in Python?

The get_close_matches() function returns a list of close matched strings that satisfy the cutoff. The order of close matched string is based on similarity score, so the most similar string comes first in the list.

Is Difflib standard Python?

This module in the python standard library provides classes and functions for comparing sequences like strings, lists etc.


2 Answers

If you wish to do as I suggested in the comments, (removing the junk characters) the fastest method is to use str.translate().

E.g:

to_compare = to_compare.translate(None, {"-"})

As shown here, this is significantly (3x) faster (and I feel nicer to read) than a regex.

Note that under Python 3.x, or if you are using Unicode under Python 2.x, this will not work as the delchars parameter is not accepted. In this case, you simply need to make a mapping to None. E.g:

translation_map = str.maketrans({"-": None})
to_compare = to_compare.translate(translation_map)

You could also have a small function to save some typing if you have a lot of characters you want to remove, just make a set and pass through:

def to_translation_map(iterable):
    return {key: None for key in iterable}
    #return dict((key, None) for key in iterable) #For old versions of Python without dict comps.
like image 75
Gareth Latty Avatar answered Sep 27 '22 02:09

Gareth Latty


If you were to make a function to remove all the junk character before hand you could use re:

string=re.sub('-|_|\*','',string)

for the regular expression '-|_|\*' just put a | between all junk characters and if its a special re character put a \ before it (like * and +)

like image 43
apple16 Avatar answered Sep 27 '22 02:09

apple16