Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Capturing groups with an or operator in Python

Tags:

python

regex

I have found odd behavior in Python 3.7.0 when capturing groups with an or operator when one branch initially matches but the regex has to eventually backtrack and use a different branch. In this scenario, the capture groups stick with the first branch even though the regex uses the second branch.

Example code:

regexString = "^(a)|(ab)$"

captureString = "ab"

match = re.match(regexString, captureString)

print(match.groups())

Output:

('a', None)

The second group is the group that is used, but the first group is captured and the second group isn't.

Interestingly, I have found a workaround by adding non-capturing parentheses around both groups like so:

regexString = "^(?:(a)|(ab))$"

New Output:

(None, 'ab')

To me this behavior looks like a bug. If it is not, can someone point me to some documentation explaining why this is occurring? Thank you!

like image 741
Michael Avatar asked Jun 28 '26 15:06

Michael


1 Answers

This is a common regex mistake. Here is your original pattern:

^(a)|(ab)$

This actually says to match ^a, i.e. a at the start of the input or ab$, i.e. ab at the end of the input. If you instead want to match a or ab as the entire input, then as you figured out you need:

^(?:(a)|(ab))$

To further convince yourself of this behavior, you may verify that the following pattern matches the same things as your original pattern:

(ab)$|^(a)

That is, each term in alternation is separate, and the position does not even matter, at least with regard to which inputs would match or nor match. By the way, you could have just used the following pattern:

^ab?$

This would match a or ab, and also you would not even need a capture group, as the entire match would correspond to what you want.

like image 133
Tim Biegeleisen Avatar answered Jul 01 '26 05:07

Tim Biegeleisen