I am using difflib python package. No matter whether I set isjunk
argument, the calculated ratios are the same. Isn't the difference of spaces ignored when isjunk
is lambda x: x == " "
?
In [193]: difflib.SequenceMatcher(isjunk=lambda x: x == " ", a="a b c", b="a bc").ratio()
Out[193]: 0.8888888888888888
In [194]: difflib.SequenceMatcher(a="a b c", b="a bc").ratio()
Out[194]: 0.8888888888888888
The strip() method is the most commonly accepted method to remove whitespaces in Python. It is a Python built-in function that trims a string by removing all leading and trailing whitespaces.
Python String isspace() method returns “True” if all characters in the string are whitespace characters, Otherwise, It returns “False”. This function is used to check if the argument contains all whitespace characters, such as: ' ' – Space.
Note: When comparing two strings in java, we should not use the == or != operators. These operators actually test references, and since multiple String objects can represent the same String, this is liable to give the wrong answer. Instead, use the String.
Removing All Whitespace From a StringreplaceAll() works with regular expressions (regex). We can use the regex character class '\s' to match a whitespace character. We can replace each whitespace character in the input string with an empty string to solve the problem: inputString. replaceAll(“\\s”, “”).
isjunk
works a little differently than you might think. In general, isjunk
merely identifies one or more characters that do not affect the length of a match but that are still included in the total character count. For example, consider the following:
>>> SequenceMatcher(lambda x: x in "abcd", " abcd", "abcd abcd").ratio()
0.7142857142857143
The first four characters of the second string ("abcd"
) are all ignorable, so the second string can be compared to the first string beginning with the space. Starting with the space in both the first string and the second string, then, the above SequenceMatcher
finds ten matching characters (five in each string) and 4 non-matching characters (the ignorable first four characters in the second string). This gives you a ratio of 10/14 (0.7142857142857143).
In your case, then, the first string "a b c"
matches the second string at indices 0, 1, and 2 (with values "a b"
). Index 3 of the first string (" "
) does not have a match but is ignored with regard to the length of the match. Since the space is ignored, index 4 ("c"
) matches index 3 of the second string. Thus 8 of your 9 characters match, giving you a ratio of 0.88888888888888
.
You might want to try this instead:
>>> c = a.replace(' ', '')
>>> d = b.replace(' ', '')
>>> difflib.SequenceMatcher(a=c, b=d).ratio()
1.0
You can see what it considers to be matching blocks:
>>> difflib.SequenceMatcher(isjunk=lambda x: x == " ", a="a b c", b="a bc").get_matching_blocks()
[Match(a=0, b=0, size=3), Match(a=4, b=3, size=1), Match(a=5, b=4, size=0)]
The first two tell you that it matches "a b" to "a b" and "c" to "c". (The last one is trivial)
The question is why "a b" can be matched. I found the answer to this in the code. First the algorithm finds a bunch of matching blocks by repeatedly calling find_longest_match. What's notable about find_longest_match is that it allows the junk character to exist on the ends of the string:
If isjunk is defined, first the longest matching block is
determined as above, but with the additional restriction that no
junk element appears in the block. Then that block is extended as
far as possible by matching (only) junk elements on both sides. So
the resulting block never matches on junk except as identical junk
happens to be adjacent to an "interesting" match.
This means that first it considers "a " and " b" to be matches (allowing the space character on the end of "a " and at the beginning of " b").
Then, the interesting part: the code does one last check to see if any of the blocks are adjacent, and merges them if they are. See this comment in the code:
# It's possible that we have adjacent equal blocks in the
# matching_blocks list now. Starting with 2.5, this code was added
# to collapse them.
So basically it's matching "a " and " b", then merging those two blocks into "a b" and calling that a match, despite the space character being junk.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With