I have a string and three patterns that I want to match and I use the python re
package. Specifically, if one of the pattern is found, output "Dislikes", otherwise, output "Likes". Brief info about the three patterns:
pattern 1: check if all character in string is uppercase letter
pattern 2: check if consecutive character are the same, for example,
AA
,BB
...pattern3 : check if pattern
XYXY
exist,X
andY
can be same and letters in this pattern do not need to be next to each other.
When I write the pattern separately, the program runs as expected. But when I combine the 3 patterns using alternation |
, the result is wrong. I have check the stackoverflow post, for example, here and here. Solution provided there do not work for me.
Here is the original code that works fine:
import sys
import re
if __name__ == "__main__":
pattern1 = re.compile(r"[^A-Z]+")
pattern2 = re.compile(r"([A-Z])\1")
pattern3 = re.compile(r"([A-Z])[A-Z]*([A-Z])[A-Z]*\1[A-Z]*\2")
word = sys.stdin.readline()
word = word.rstrip('\n')
if pattern1.search(word) or pattern2.search(word) or pattern3.search(word):
print("Dislikes")
else:
print("Likes")
If I combine the 3 pattern to one using the following code, something is wrong:
import sys
import re
if __name__ == "__main__":
pattern = r"([A-Z])[A-Z]*([A-Z])[A-Z]*\1[A-Z]*\2|([A-Z])\1|[^A-Z]+"
word = sys.stdin.readline()
word = word.rstrip('\n')
if re.search(word, pattern):
print("Dislikes")
else:
print("Likes")
If we call the 3 patterns p1
, p2
, and p3
, I also tried the following combination:
pattern = r"(p1|p2|p3)"
pattern = r"(p1)|(p2)|(p3)"
But they also do not work as expected. What is the correct to combine them?
ABC
, ABCD
, A
, ABCBA
ABBC
(pattern2), THETXH
(pattern3), ABACADA
(pattern3), AbCD
(pattern1)The character + in a regular expression means "match the preceding character one or more times". For example A+ matches one or more of character A. The plus character, used in a regular expression, is called a Kleene plus .
We say that two regular expressions R and S are equivalent if they describe the same language. In other words, if L(R) = L(S) for two regular expressions R and S then R = S.
To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).
Here is a single pattern that joins yours:
([^A-Z]+|([A-Z])\2|([A-Z])[A-Z]*([A-Z])[A-Z]*\3[A-Z]*\4)
So, why does it work?
It consists of a simple (p1|p2|p3)
pattern, where p1
, p2
and p3
are those you defined before:
[^A-Z]+
([A-Z])\1
([A-Z])[A-Z]*([A-Z])[A-Z]*\1[A-Z]*\2
It can be decomposed as:
(
[^A-Z]+
|([A-Z])\2
|([A-Z])[A-Z]*([A-Z])[A-Z]*\3[A-Z]*\
)
The problem you were encoutering is the numbering of the groups.
First off, when you combine p2
and p3
, both refer to \1
, but the latter represents different things across the two patterns.
Therefore, p3
should become ...\2...\3
, since there is an additional group before.
Furthermore, the group indices refered to by \number
are indexed in the order in which they are opened.
As a consequence, the very first parenthesis, corresponding to the opening of the outer (...|...|...)
, is counted as the first group, and \1
will refer to it.
Of course, this is not what you want.
But in addition, this gives you an error, because then, \1
refers to a group that has not been closed yet, and thus not defined.
Therefore, the indices should be shifted by one, becoming \2
, \3
and \4
.
Such A|B
regexes are usually nested into parentheses, but the outer ones could actually be dropped, and the indices shifted back by one:
[^A-Z]+|([A-Z])\1|([A-Z])[A-Z]*([A-Z])[A-Z]*\2[A-Z]*\3
Here is a small demonstration of this pattern:
import sys
import re
if __name__ == "__main__":
pattern1 = re.compile(r"[^A-Z]+")
pattern2 = re.compile(r"([A-Z])\1")
pattern3 = re.compile(r"([A-Z])[A-Z]*([A-Z])[A-Z]*\1[A-Z]*\2")
pattern = re.compile(r"([^A-Z]+|([A-Z])\2|([A-Z])[A-Z]*([A-Z])[A-Z]*\3[A-Z]*\4)")
while True:
try:
word = input("> ")
print(pattern1.search(word))
print(pattern2.search(word))
print(pattern3.search(word))
print(pattern.search(word))
except Exception as error:
print(error)
Interactive session:
> ABC # Matches no pattern
None
None
None
None
> ABCBA # Matches no pattern
None
None
None
None
> ABBC # Matches p2
None
<_sre.SRE_Match object; span=(1, 3), match='BB'> # p2 is matched
None
<_sre.SRE_Match object; span=(1, 3), match='BB'> # Jointure gives the same match
> ABACADA # Matches p3
None
None
<_sre.SRE_Match object; span=(0, 7), match='ABACADA'> # p3 is matched
<_sre.SRE_Match object; span=(0, 7), match='ABACADA'> # Jointure gives the same match
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With