I have found odd behavior in Python 3.7.0 when capturing groups with an or operator when one branch initially matches but the regex has to eventually backtrack and use a different branch. In this scenario, the capture groups stick with the first branch even though the regex uses the second branch.
Example code:
regexString = "^(a)|(ab)$"
captureString = "ab"
match = re.match(regexString, captureString)
print(match.groups())
Output:
('a', None)
The second group is the group that is used, but the first group is captured and the second group isn't.
Interestingly, I have found a workaround by adding non-capturing parentheses around both groups like so:
regexString = "^(?:(a)|(ab))$"
New Output:
(None, 'ab')
To me this behavior looks like a bug. If it is not, can someone point me to some documentation explaining why this is occurring? Thank you!
This is a common regex mistake. Here is your original pattern:
^(a)|(ab)$
This actually says to match ^a, i.e. a at the start of the input or ab$, i.e. ab at the end of the input. If you instead want to match a or ab as the entire input, then as you figured out you need:
^(?:(a)|(ab))$
To further convince yourself of this behavior, you may verify that the following pattern matches the same things as your original pattern:
(ab)$|^(a)
That is, each term in alternation is separate, and the position does not even matter, at least with regard to which inputs would match or nor match. By the way, you could have just used the following pattern:
^ab?$
This would match a or ab, and also you would not even need a capture group, as the entire match would correspond to what you want.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With