Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Matching multiple regex patterns with the alternation operator?

I ran into a small problem using Python Regex.

Suppose this is the input:

(zyx)bc

What I'm trying to achieve is obtain whatever is between parentheses as a single match, and any char outside as an individual match. The desired result would be along the lines of:

['zyx','b','c']

The order of matches should be kept.

I've tried obtaining this with Python 3.3, but can't seem to figure out the correct Regex. So far I have:

matches = findall(r'\((.*?)\)|\w', '(zyx)bc')

print(matches) yields the following:

['zyx','','']

Any ideas what I'm doing wrong?

like image 868
Julian Laval Avatar asked Jan 06 '13 12:01

Julian Laval


People also ask

How do you express alternation or in a regular expression?

We can write both variants in a regexp using alternation: [01]\d|2[0-3] . Next, minutes must be from 00 to 59 . In the regular expression language that can be written as [0-5]\d : the first digit 0-5 , and then any digit. If we glue hours and minutes together, we get the pattern: [01]\d|2[0-3]:[0-5]\d .

Which regex symbol is used in alternation?

The Alternation Operator ( | or \| ) Alternatives match one of a choice of regular expressions: if you put the character(s) representing the alternation operator between any two regular expressions a and b , the result matches the union of the strings that a and b match.

How do I match a pattern in regex?

Using special characters For example, to match a single "a" followed by zero or more "b" s followed by "c" , you'd use the pattern /ab*c/ : the * after "b" means "0 or more occurrences of the preceding item."

How do I search for multiple patterns in Python?

Use | (pipe) operator to specify multiple patterns.


2 Answers

From the documentation of re.findall:

If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.

While your regexp is matching the string three times, the (.*?) group is empty for the second two matches. If you want the output of the other half of the regexp, you can add a second group:

>>> re.findall(r'\((.*?)\)|(\w)', '(zyx)bc')
[('zyx', ''), ('', 'b'), ('', 'c')]

Alternatively, you could remove all the groups to get a simple list of strings again:

>>> re.findall(r'\(.*?\)|\w', '(zyx)bc')
['(zyx)', 'b', 'c']

You would need to manually remove the parentheses though.

like image 103
James Henstridge Avatar answered Sep 19 '22 21:09

James Henstridge


Other answers have shown you how to get the result you need, but with the extra step of manually removing the parentheses. If you use lookarounds in your regex, you won't need to strip the parentheses manually:

>>> import re
>>> s = '(zyx)bc'
>>> print (re.findall(r'(?<=\()\w+(?=\))|\w', s))
['zyx', 'b', 'c']

Explained:

(?<=\() // lookbehind for left parenthesis
\w+     // all characters until:
(?=\))  // lookahead for right parenthesis
|       // OR
\w      // any character
like image 22
alan Avatar answered Sep 22 '22 21:09

alan