Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why isn't the regular expression's "non-capturing" group working?

Tags:

python

regex

In the snippet below, the non-capturing group "(?:aaa)" should be ignored in the matching result,

The result should be "_bbb" only.

However, I get "aaa_bbb" in the matching result; only when I specify group(2) does it show "_bbb".

>>> import re >>> s = "aaa_bbb" >>> print(re.match(r"(?:aaa)(_bbb)", s).group())  aaa_bbb 
like image 414
Jim Horng Avatar asked Apr 24 '10 02:04

Jim Horng


People also ask

How do Capturing groups work in regex?

Capturing groups are a way to treat multiple characters as a single unit. They are created by placing the characters to be grouped inside a set of parentheses. For example, the regular expression (dog) creates a single group containing the letters "d" "o" and "g" .

What is the use of non-capturing group in regex?

Overview. Non-capturing groups are important constructs within Java Regular Expressions. They create a sub-pattern that functions as a single unit but does not save the matched character sequence.

What is non-capturing group in regex python?

This syntax captures whatever match X inside the match so that you can access it via the group() method of the Match object. Sometimes, you may want to create a group but don't want to capture it in the groups of the match. To do that, you can use a non-capturing group with the following syntax: (?:X)

When capturing regex groups what datatype does the groups method return?

The re. groups() method This method returns a tuple containing all the subgroups of the match, from 1 up to however many groups are in the pattern.


2 Answers

I think you're misunderstanding the concept of a "non-capturing group". The text matched by a non-capturing group still becomes part of the overall regex match.

Both the regex (?:aaa)(_bbb) and the regex (aaa)(_bbb) return aaa_bbb as the overall match. The difference is that the first regex has one capturing group which returns _bbb as its match, while the second regex has two capturing groups that return aaa and _bbb as their respective matches. In your Python code, to get _bbb, you'd need to use group(1) with the first regex, and group(2) with the second regex.

The main benefit of non-capturing groups is that you can add them to a regex without upsetting the numbering of the capturing groups in the regex. They also offer (slightly) better performance as the regex engine doesn't have to keep track of the text matched by non-capturing groups.

If you really want to exclude aaa from the overall regex match then you need to use lookaround. In this case, positive lookbehind does the trick: (?<=aaa)_bbb. With this regex, group() returns _bbb in Python. No capturing groups needed.

My recommendation is that if you have the ability to use capturing groups to get part of the regex match, use that method instead of lookaround.

like image 160
Jan Goyvaerts Avatar answered Sep 17 '22 20:09

Jan Goyvaerts


group() and group(0) will return the entire match. Subsequent groups are actual capture groups.

>>> print (re.match(r"(?:aaa)(_bbb)", string1).group(0)) aaa_bbb >>> print (re.match(r"(?:aaa)(_bbb)", string1).group(1)) _bbb >>> print (re.match(r"(?:aaa)(_bbb)", string1).group(2)) Traceback (most recent call last):   File "<stdin>", line 1, in ? IndexError: no such group 

If you want the same behavior than group():

" ".join(re.match(r"(?:aaa)(_bbb)", string1).groups())

like image 38
Richard Simões Avatar answered Sep 20 '22 20:09

Richard Simões