I have a lot of strings that i want to match for similarity(each string is 30 characters on average). I found difflib's
SequenceMatcher
great for this task as it was simple and found the results good. But if i compare hellboy
and hell-boy
like this
>>> sm=SequenceMatcher(lambda x:x=='-','hellboy','hell-boy')
>>> sm.ratio()
0: 0.93333333333333335
I want such words to give a 100 percent match i.e ratio of 1.0
. I understand that the junk character specified in the function above are not used for comparison but finding longest contiguous matching subsequence. Is there some way i can make SequenceMatcher
to ignore some "junk" characters for comparison purpose?
SequenceMatcher is a class that is available in the difflib Python package. The difflib module provides classes and functions for comparing sequences. It can be used to compare files and can produce information about file differences in various formats. This class can be used to compare two input sequences or strings.
Difflib is a built-in module in the Python programming language consisting of different simple functions and classes that allow users to compare data sets. The module offers the outputs of these sequence comparisons in a format that can be read by a human, using deltas to show the differences more efficiently.
The get_close_matches() function returns a list of close matched strings that satisfy the cutoff. The order of close matched string is based on similarity score, so the most similar string comes first in the list.
This module in the python standard library provides classes and functions for comparing sequences like strings, lists etc.
If you wish to do as I suggested in the comments, (removing the junk characters) the fastest method is to use str.translate()
.
E.g:
to_compare = to_compare.translate(None, {"-"})
As shown here, this is significantly (3x) faster (and I feel nicer to read) than a regex.
Note that under Python 3.x, or if you are using Unicode under Python 2.x, this will not work as the delchars
parameter is not accepted. In this case, you simply need to make a mapping to None. E.g:
translation_map = str.maketrans({"-": None})
to_compare = to_compare.translate(translation_map)
You could also have a small function to save some typing if you have a lot of characters you want to remove, just make a set and pass through:
def to_translation_map(iterable):
return {key: None for key in iterable}
#return dict((key, None) for key in iterable) #For old versions of Python without dict comps.
If you were to make a function to remove all the junk character before hand you could use re:
string=re.sub('-|_|\*','',string)
for the regular expression '-|_|\*'
just put a | between all junk characters and if its a special re character put a \
before it (like *
and +
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With