Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python regular expression: match either one of several regular expressions

Tags:

python

regex

I have a string and three patterns that I want to match and I use the python re package. Specifically, if one of the pattern is found, output "Dislikes", otherwise, output "Likes". Brief info about the three patterns:

pattern 1: check if all character in string is uppercase letter

pattern 2: check if consecutive character are the same, for example, AA, BB...

pattern3 : check if pattern XYXY exist, X and Y can be same and letters in this pattern do not need to be next to each other.

When I write the pattern separately, the program runs as expected. But when I combine the 3 patterns using alternation |, the result is wrong. I have check the stackoverflow post, for example, here and here. Solution provided there do not work for me.

Here is the original code that works fine:

import sys
import re

if __name__ == "__main__":
    pattern1 = re.compile(r"[^A-Z]+")
    pattern2 = re.compile(r"([A-Z])\1")
    pattern3 = re.compile(r"([A-Z])[A-Z]*([A-Z])[A-Z]*\1[A-Z]*\2")

    word = sys.stdin.readline()
    word = word.rstrip('\n')
    if pattern1.search(word) or pattern2.search(word) or pattern3.search(word):
        print("Dislikes")
    else:
        print("Likes")

If I combine the 3 pattern to one using the following code, something is wrong:

import sys
import re

if __name__ == "__main__":

    pattern = r"([A-Z])[A-Z]*([A-Z])[A-Z]*\1[A-Z]*\2|([A-Z])\1|[^A-Z]+"

    word = sys.stdin.readline()

    word = word.rstrip('\n')
    if re.search(word, pattern):
        print("Dislikes")
    else:
       print("Likes")

If we call the 3 patterns p1, p2, and p3, I also tried the following combination:

pattern = r"(p1|p2|p3)"
pattern = r"(p1)|(p2)|(p3)"

But they also do not work as expected. What is the correct to combine them?

Test cases:

  • "Likes": ABC, ABCD, A, ABCBA
  • "Dislikes": ABBC (pattern2), THETXH(pattern3), ABACADA(pattern3), AbCD(pattern1)
like image 591
jdhao Avatar asked Sep 04 '17 14:09

jdhao


People also ask

Which regular expression do you use to match one or more of the preceding characters?

The character + in a regular expression means "match the preceding character one or more times". For example A+ matches one or more of character A. The plus character, used in a regular expression, is called a Kleene plus .

How can I tell if two regex is same?

We say that two regular expressions R and S are equivalent if they describe the same language. In other words, if L(R) = L(S) for two regular expressions R and S then R = S.

How do I match a regex pattern?

To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).


1 Answers

Here is a single pattern that joins yours:

([^A-Z]+|([A-Z])\2|([A-Z])[A-Z]*([A-Z])[A-Z]*\3[A-Z]*\4)

So, why does it work?

It consists of a simple (p1|p2|p3) pattern, where p1, p2 and p3 are those you defined before:

[^A-Z]+
([A-Z])\1
([A-Z])[A-Z]*([A-Z])[A-Z]*\1[A-Z]*\2

It can be decomposed as:

(
  [^A-Z]+
 |([A-Z])\2
 |([A-Z])[A-Z]*([A-Z])[A-Z]*\3[A-Z]*\
)

The problem you were encoutering is the numbering of the groups.

First off, when you combine p2 and p3, both refer to \1, but the latter represents different things across the two patterns. Therefore, p3 should become ...\2...\3, since there is an additional group before.

Furthermore, the group indices refered to by \number are indexed in the order in which they are opened. As a consequence, the very first parenthesis, corresponding to the opening of the outer (...|...|...), is counted as the first group, and \1 will refer to it. Of course, this is not what you want. But in addition, this gives you an error, because then, \1 refers to a group that has not been closed yet, and thus not defined.

Therefore, the indices should be shifted by one, becoming \2, \3 and \4.

Such A|B regexes are usually nested into parentheses, but the outer ones could actually be dropped, and the indices shifted back by one:

[^A-Z]+|([A-Z])\1|([A-Z])[A-Z]*([A-Z])[A-Z]*\2[A-Z]*\3

Here is a small demonstration of this pattern:

import sys
import re

if __name__ == "__main__":
    pattern1 = re.compile(r"[^A-Z]+")
    pattern2 = re.compile(r"([A-Z])\1")
    pattern3 = re.compile(r"([A-Z])[A-Z]*([A-Z])[A-Z]*\1[A-Z]*\2")    
    pattern = re.compile(r"([^A-Z]+|([A-Z])\2|([A-Z])[A-Z]*([A-Z])[A-Z]*\3[A-Z]*\4)")

    while True:
        try:
            word = input("> ")
            print(pattern1.search(word))
            print(pattern2.search(word))
            print(pattern3.search(word))
            print(pattern.search(word))
        except Exception as error:
            print(error)

Interactive session:

> ABC    # Matches no pattern
None
None
None
None

> ABCBA  # Matches no pattern
None
None
None
None

> ABBC   # Matches p2
None
<_sre.SRE_Match object; span=(1, 3), match='BB'> # p2 is matched
None
<_sre.SRE_Match object; span=(1, 3), match='BB'> # Jointure gives the same match

> ABACADA # Matches p3
None
None
<_sre.SRE_Match object; span=(0, 7), match='ABACADA'> # p3 is matched
<_sre.SRE_Match object; span=(0, 7), match='ABACADA'> # Jointure gives the same match
like image 131
Right leg Avatar answered Oct 06 '22 19:10

Right leg