I have a lot of strings that i want to match for similarity(each string is 30 characters on average). I found <code>difflib's</code> <code>SequenceMatcher</code> great for this task as it was simple and found the results good. But if i compare <code>hellboy</code> and <code>hell-boy</code> like this <pre class="prettyprint"><code>>>> sm=SequenceMatcher(lambda x:x=='-','hellboy','hell-boy') >>> sm.ratio() 0: 0.93333333333333335 </code></pre> I want such words to give a 100 percent match i.e <code>ratio of 1.0</code>. I understand that the junk character specified in the function above are not used for comparison but finding longest contiguous matching subsequence. Is there some way i can make <code>SequenceMatcher</code> to ignore some "junk" characters for comparison purpose?

If you wish to do as I suggested in the comments, (removing the junk characters) the fastest method is to use <code>str.translate()</code>. E.g: <pre class="prettyprint"><code>to_compare = to_compare.translate(None, {"-"}) </code></pre> As shown here, this is significantly (3x) faster (and I feel nicer to read) than a regex. Note that under Python 3.x, or if you are using Unicode under Python 2.x, this will not work as the <code>delchars</code> parameter is not accepted. In this case, you simply need to make a mapping to None. E.g: <pre class="prettyprint"><code>translation_map = str.maketrans({"-": None}) to_compare = to_compare.translate(translation_map) </code></pre> You could also have a small function to save some typing if you have a lot of characters you want to remove, just make a set and pass through: <pre class="prettyprint"><code>def to_translation_map(iterable): return {key: None for key in iterable} #return dict((key, None) for key in iterable) #For old versions of Python without dict comps. </code></pre>

If you were to make a function to remove all the junk character before hand you could use re: <pre class="prettyprint"><code>string=re.sub('-|_|\*','',string) </code></pre> for the regular expression <code>'-|_|\*'</code> just put a | between all junk characters and if its a special re character put a <code>\</code> before it (like <code>*</code> and <code>+</code>)

making difflib's SequenceMatcher ignore "junk" characters

Tags:

python

difflib

sequencematcher

I have a lot of strings that i want to match for similarity(each string is 30 characters on average). I found difflib's SequenceMatcher great for this task as it was simple and found the results good. But if i compare hellboy and hell-boy like this

>>> sm=SequenceMatcher(lambda x:x=='-','hellboy','hell-boy')
>>> sm.ratio()
0: 0.93333333333333335

I want such words to give a 100 percent match i.e ratio of 1.0. I understand that the junk character specified in the function above are not used for comparison but finding longest contiguous matching subsequence. Is there some way i can make SequenceMatcher to ignore some "junk" characters for comparison purpose?

204

asked Apr 02 '12 20:04

lovesh

2 Answers

If you wish to do as I suggested in the comments, (removing the junk characters) the fastest method is to use str.translate().

E.g:

to_compare = to_compare.translate(None, {"-"})

As shown here, this is significantly (3x) faster (and I feel nicer to read) than a regex.

Note that under Python 3.x, or if you are using Unicode under Python 2.x, this will not work as the delchars parameter is not accepted. In this case, you simply need to make a mapping to None. E.g:

translation_map = str.maketrans({"-": None})
to_compare = to_compare.translate(translation_map)

You could also have a small function to save some typing if you have a lot of characters you want to remove, just make a set and pass through:

def to_translation_map(iterable):
    return {key: None for key in iterable}
    #return dict((key, None) for key in iterable) #For old versions of Python without dict comps.

answered Sep 27 '22 02:09

Gareth Latty

If you were to make a function to remove all the junk character before hand you could use re:

string=re.sub('-|_|\*','',string)

for the regular expression '-|_|\*' just put a | between all junk characters and if its a special re character put a \ before it (like * and +)

answered Sep 27 '22 02:09

apple16

Related questions
                            
                                Python watching for process start up?
                            
                                Creating package installer in OS X - install Python, NumPy and other dependencies
                            
                                Equivalent of wget in Python to download website and resources
                            
                                Passing array of Argument into multiple parameter function in C#
                            
                                Can I use Python 2.7 subprocess module from Python 2.6?
                            
                                A tutorial for a web-based chat server in Python [closed]
                            
                                Specify lag in numpy.correlate
                            
                                How to delete the biggest emails from my gmail using a python script?
                            
                                Python Speech Compare
                            
                                grouper with a chunk size sequence in python?
                            
                                Python LDAP and Active Directory issue
                            
                                Hosting a tornado/websocket application
                            
                                Grouping arbitrary arrays of data into N bins
                            
                                Remove a file forcefuly as in "rm -f" or unlink a filepath from directory forcefully
                            
                                Pandas DataFrame - desired index has duplicate values
                            
                                How to find out which chars are defined as alphanumeric for a given locale
                            
                                traceback.format_exc/print_exc returns None when expecting traceback
                            
                                Heterogeneous forms in django formset
                            
                                How do i run the python 'sdist' command from within a python automated script without using subprocess?
                            
                                python set intersection with object sets

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With