Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ignore spaces when comparing strings in python

I am using difflib python package. No matter whether I set isjunk argument, the calculated ratios are the same. Isn't the difference of spaces ignored when isjunk is lambda x: x == " "?

In [193]: difflib.SequenceMatcher(isjunk=lambda x: x == " ", a="a b c", b="a bc").ratio()
Out[193]: 0.8888888888888888

In [194]: difflib.SequenceMatcher(a="a b c", b="a bc").ratio()
Out[194]: 0.8888888888888888
like image 615
RNA Avatar asked May 08 '13 02:05

RNA


People also ask

How do you ignore spaces in Python?

The strip() method is the most commonly accepted method to remove whitespaces in Python. It is a Python built-in function that trims a string by removing all leading and trailing whitespaces.

How does Python compare characters with spaces?

Python String isspace() method returns “True” if all characters in the string are whitespace characters, Otherwise, It returns “False”. This function is used to check if the argument contains all whitespace characters, such as: ' ' – Space.

Can you use != To compare strings?

Note: When comparing two strings in java, we should not use the == or != operators. These operators actually test references, and since multiple String objects can represent the same String, this is liable to give the wrong answer. Instead, use the String.

How do you remove spaces from a string?

Removing All Whitespace From a StringreplaceAll() works with regular expressions (regex). We can use the regex character class '\s' to match a whitespace character. We can replace each whitespace character in the input string with an empty string to solve the problem: inputString. replaceAll(“\\s”, “”).


2 Answers

isjunk works a little differently than you might think. In general, isjunk merely identifies one or more characters that do not affect the length of a match but that are still included in the total character count. For example, consider the following:

>>> SequenceMatcher(lambda x: x in "abcd", " abcd", "abcd abcd").ratio()
0.7142857142857143

The first four characters of the second string ("abcd") are all ignorable, so the second string can be compared to the first string beginning with the space. Starting with the space in both the first string and the second string, then, the above SequenceMatcher finds ten matching characters (five in each string) and 4 non-matching characters (the ignorable first four characters in the second string). This gives you a ratio of 10/14 (0.7142857142857143).

In your case, then, the first string "a b c" matches the second string at indices 0, 1, and 2 (with values "a b"). Index 3 of the first string (" ") does not have a match but is ignored with regard to the length of the match. Since the space is ignored, index 4 ("c") matches index 3 of the second string. Thus 8 of your 9 characters match, giving you a ratio of 0.88888888888888.

You might want to try this instead:

>>> c = a.replace(' ', '')
>>> d = b.replace(' ', '')
>>> difflib.SequenceMatcher(a=c, b=d).ratio()
1.0
like image 122
Justin O Barber Avatar answered Oct 18 '22 19:10

Justin O Barber


You can see what it considers to be matching blocks:

>>> difflib.SequenceMatcher(isjunk=lambda x: x == " ", a="a b c", b="a bc").get_matching_blocks()
[Match(a=0, b=0, size=3), Match(a=4, b=3, size=1), Match(a=5, b=4, size=0)]

The first two tell you that it matches "a b" to "a b" and "c" to "c". (The last one is trivial)

The question is why "a b" can be matched. I found the answer to this in the code. First the algorithm finds a bunch of matching blocks by repeatedly calling find_longest_match. What's notable about find_longest_match is that it allows the junk character to exist on the ends of the string:

If isjunk is defined, first the longest matching block is
determined as above, but with the additional restriction that no
junk element appears in the block.  Then that block is extended as
far as possible by matching (only) junk elements on both sides.  So
the resulting block never matches on junk except as identical junk
happens to be adjacent to an "interesting" match.

This means that first it considers "a " and " b" to be matches (allowing the space character on the end of "a " and at the beginning of " b").

Then, the interesting part: the code does one last check to see if any of the blocks are adjacent, and merges them if they are. See this comment in the code:

    # It's possible that we have adjacent equal blocks in the
    # matching_blocks list now.  Starting with 2.5, this code was added
    # to collapse them.

So basically it's matching "a " and " b", then merging those two blocks into "a b" and calling that a match, despite the space character being junk.

like image 31
chappy Avatar answered Oct 18 '22 18:10

chappy