ignore spaces when comparing strings in python

Tags:

I am using difflib python package. No matter whether I set isjunk argument, the calculated ratios are the same. Isn't the difference of spaces ignored when isjunk is lambda x: x == " "?

In [193]: difflib.SequenceMatcher(isjunk=lambda x: x == " ", a="a b c", b="a bc").ratio()
Out[193]: 0.8888888888888888

In [194]: difflib.SequenceMatcher(a="a b c", b="a bc").ratio()
Out[194]: 0.8888888888888888

615

asked May 08 '13 02:05

RNA

2 Answers

isjunk works a little differently than you might think. In general, isjunk merely identifies one or more characters that do not affect the length of a match but that are still included in the total character count. For example, consider the following:

>>> SequenceMatcher(lambda x: x in "abcd", " abcd", "abcd abcd").ratio()
0.7142857142857143

The first four characters of the second string ("abcd") are all ignorable, so the second string can be compared to the first string beginning with the space. Starting with the space in both the first string and the second string, then, the above SequenceMatcher finds ten matching characters (five in each string) and 4 non-matching characters (the ignorable first four characters in the second string). This gives you a ratio of 10/14 (0.7142857142857143).

In your case, then, the first string "a b c" matches the second string at indices 0, 1, and 2 (with values "a b"). Index 3 of the first string (" ") does not have a match but is ignored with regard to the length of the match. Since the space is ignored, index 4 ("c") matches index 3 of the second string. Thus 8 of your 9 characters match, giving you a ratio of 0.88888888888888.

You might want to try this instead:

>>> c = a.replace(' ', '')
>>> d = b.replace(' ', '')
>>> difflib.SequenceMatcher(a=c, b=d).ratio()
1.0

122

answered Oct 18 '22 19:10

Justin O Barber

You can see what it considers to be matching blocks:

>>> difflib.SequenceMatcher(isjunk=lambda x: x == " ", a="a b c", b="a bc").get_matching_blocks()
[Match(a=0, b=0, size=3), Match(a=4, b=3, size=1), Match(a=5, b=4, size=0)]

The first two tell you that it matches "a b" to "a b" and "c" to "c". (The last one is trivial)

The question is why "a b" can be matched. I found the answer to this in the code. First the algorithm finds a bunch of matching blocks by repeatedly calling find_longest_match. What's notable about find_longest_match is that it allows the junk character to exist on the ends of the string:

If isjunk is defined, first the longest matching block is
determined as above, but with the additional restriction that no
junk element appears in the block.  Then that block is extended as
far as possible by matching (only) junk elements on both sides.  So
the resulting block never matches on junk except as identical junk
happens to be adjacent to an "interesting" match.

This means that first it considers "a " and " b" to be matches (allowing the space character on the end of "a " and at the beginning of " b").

Then, the interesting part: the code does one last check to see if any of the blocks are adjacent, and merges them if they are. See this comment in the code:

    # It's possible that we have adjacent equal blocks in the
    # matching_blocks list now.  Starting with 2.5, this code was added
    # to collapse them.

So basically it's matching "a " and " b", then merging those two blocks into "a b" and calling that a match, despite the space character being junk.

answered Oct 18 '22 18:10

chappy

Related questions
                            
                                Git pre-commit hook: getting list of changed files
                            
                                How to use Mathematica functions in Python programs? [closed]
                            
                                PayPal IPN POST request encoding
                            
                                Time-weighted average with Pandas
                            
                                python urllib2 - wait for page to finish loading/redirecting before scraping?
                            
                                update a figure made with imshow(), contour() and quiver()
                            
                                Calling Python code from a C thread
                            
                                sys.stdout not reassigning to sys.__stdout__
                            
                                Implement a python list with constraints
                            
                                Print Python output by PHP Code
                            
                                Python inheritance, metaclasses and type() function
                            
                                How to refresh the multi-line output dynamically
                            
                                Most lightweight way to plot streaming data in python
                            
                                add a field in pandas dataframe with MultiIndex columns
                            
                                Starting new subproces from a Flask request
                            
                                How to mutate a ndb repeated property?
                            
                                socket ResourceWarning using urllib in Python 3
                            
                                What exactly is the "QuerySet" object in Mongoengine?
                            
                                Simple demonstration of using pyparsing's indentedBlock recursively
                            
                                Handling unhandled exception in GUI

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

ignore spaces when comparing strings in python

Tags:

python

string

difflib

RNA

People also ask

2 Answers

Justin O Barber

chappy

Recent Activity

Donate For Us