I have the following line, <pre class="prettyprint"><code>typeName="ABC:xxxxx;"; </code></pre> I need to fetch the word <code>ABC</code>, I wrote the following code snippet, <pre class="prettyprint"><code>Pattern pattern4=Pattern.compile("(.*):"); matcher=pattern4.matcher(typeName); String nameStr=""; if(matcher.find()) { nameStr=matcher.group(1); } </code></pre> So if I put <code>group(0)</code> I get <code>ABC:</code> but if I put <code>group(1)</code> it is <code>ABC</code>, so I want to know <ol> <li>What does this <code>0</code> and <code>1</code> mean? It will be better if anyone can explain me with good examples. </li> <li>The regex pattern contains a <code>:</code> in it, so why <code>group(1)</code> result omits that? Does group 1 detects all the words inside the parenthesis? </li> <li>So, if I put two more parenthesis such as, <code>\\s*(\d*)(.*)</code>: then, will be there two groups? <code>group(1)</code> will return the <code>(\d*)</code> part and <code>group(2)</code> return the <code>(.*)</code> part?</li> </ol> The code snippet was given in a purpose to clear my confusions. It is not the code I am dealing with. The code given above can be done with <code>String.split()</code> in a much easier way.

<h3>Capturing and grouping</h3> Capturing group <code>(pattern)</code> creates a group that has capturing property. A related one that you might often see (and use) is <code>(?:pattern)</code>, which creates a group without capturing property, hence named non-capturing group. A group is usually used when you need to repeat a sequence of patterns, e.g. <code>(\.\w+)+</code>, or to specify where alternation should take effect, e.g. <code>^(0*1|1*0)$</code> (<code>^</code>, then <code>0*1</code> or <code>1*0</code>, then <code>$</code>) versus <code>^0*1|1*0$</code> (<code>^0*1</code> or <code>1*0$</code>). A capturing group, apart from grouping, will also record the text matched by the pattern inside the capturing group <code>(pattern)</code>. Using your example, <code>(.*):</code>, <code>.*</code> matches <code>ABC</code> and <code>:</code> matches <code>:</code>, and since <code>.*</code> is inside capturing group <code>(.*)</code>, the text <code>ABC</code> is recorded for the capturing group 1. <h3>Group number</h3> The whole pattern is defined to be group number 0. Any capturing group in the pattern start indexing from 1. The indices are defined by the order of the opening parentheses of the capturing groups. As an example, here are all 5 capturing groups in the below pattern: <pre class="prettyprint"><code>(group)(?:non-capturing-group)(g(?:ro|u)p( (nested)inside)(another)group)(?=assertion) | | | | | | || | | 1-----1 | | 4------4 |5-------5 | | 3---------------3 | 2-----------------------------------------2 </code></pre> The group numbers are used in back-reference <code>\n</code> in pattern and <code>$n</code> in replacement string. In other regex flavors (PCRE, Perl), they can also be used in sub-routine calls. You can access the text matched by certain group with <code>Matcher.group(int group)</code>. The group numbers can be identified with the rule stated above. In some regex flavors (PCRE, Perl), there is a branch reset feature which allows you to use the same number for capturing groups in different branches of alternation. <h3>Group name</h3> From Java 7, you can define a named capturing group <code>(?<name>pattern)</code>, and you can access the content matched with <code>Matcher.group(String name)</code>. The regex is longer, but the code is more meaningful, since it indicates what you are trying to match or extract with the regex. The group names are used in back-reference <code>\k<name></code> in pattern and <code>${name}</code> in replacement string. Named capturing groups are still numbered with the same numbering scheme, so they can also be accessed via <code>Matcher.group(int group)</code>. Internally, Java's implementation just maps from the name to the group number. Therefore, you cannot use the same name for 2 different capturing groups.

Java regex capturing groups indexes

Tags:

java

regex

I have the following line,

typeName="ABC:xxxxx;";

I need to fetch the word ABC,

I wrote the following code snippet,

Pattern pattern4=Pattern.compile("(.*):"); matcher=pattern4.matcher(typeName);  String nameStr=""; if(matcher.find()) {     nameStr=matcher.group(1);  }

So if I put group(0) I get ABC: but if I put group(1) it is ABC, so I want to know

What does this 0 and 1 mean? It will be better if anyone can explain me with good examples.
The regex pattern contains a : in it, so why group(1) result omits that? Does group 1 detects all the words inside the parenthesis?
So, if I put two more parenthesis such as, \\s*(\d*)(.*): then, will be there two groups? group(1) will return the (\d*) part and group(2) return the (.*) part?

The code snippet was given in a purpose to clear my confusions. It is not the code I am dealing with. The code given above can be done with String.split() in a much easier way.

852

asked May 13 '13 08:05

P basak

2 Answers

Capturing and grouping

Capturing group (pattern) creates a group that has capturing property.

A related one that you might often see (and use) is (?:pattern), which creates a group without capturing property, hence named non-capturing group.

A group is usually used when you need to repeat a sequence of patterns, e.g. (\.\w+)+, or to specify where alternation should take effect, e.g. ^(0*1|1*0)$ (^, then 0*1 or 1*0, then $) versus ^0*1|1*0$ (^0*1 or 1*0$).

A capturing group, apart from grouping, will also record the text matched by the pattern inside the capturing group (pattern). Using your example, (.*):, .* matches ABC and : matches :, and since .* is inside capturing group (.*), the text ABC is recorded for the capturing group 1.

Group number

The whole pattern is defined to be group number 0.

Any capturing group in the pattern start indexing from 1. The indices are defined by the order of the opening parentheses of the capturing groups. As an example, here are all 5 capturing groups in the below pattern:

(group)(?:non-capturing-group)(g(?:ro|u)p( (nested)inside)(another)group)(?=assertion) |     |                       |          | |      |      ||       |     | 1-----1                       |          | 4------4      |5-------5     |                               |          3---------------3              |                               2-----------------------------------------2

The group numbers are used in back-reference \n in pattern and $n in replacement string.

^{In other regex flavors (PCRE, Perl), they can also be used in sub-routine calls.}

You can access the text matched by certain group with Matcher.group(int group). The group numbers can be identified with the rule stated above.

^{In some regex flavors (PCRE, Perl), there is a branch reset feature which allows you to use the same number for capturing groups in different branches of alternation.}

Group name

From Java 7, you can define a named capturing group (?<name>pattern), and you can access the content matched with Matcher.group(String name). The regex is longer, but the code is more meaningful, since it indicates what you are trying to match or extract with the regex.

The group names are used in back-reference \k<name> in pattern and ${name} in replacement string.

Named capturing groups are still numbered with the same numbering scheme, so they can also be accessed via Matcher.group(int group).

Internally, Java's implementation just maps from the name to the group number. Therefore, you cannot use the same name for 2 different capturing groups.

133

answered Nov 13 '22 06:11

nhahtdh

For The Rest Of Us

Here is a simple and clear example of how this works:

( G1 )( G2 )( G3 )( G4 )( G5 )
Regex:([a-zA-Z0-9]+)([\s]+)([a-zA-Z ]+)([\s]+)([0-9]+)

String: "!* UserName10 John Smith 01123 *!"

group(0): UserName10 John Smith 01123 group(1): UserName10 group(2):   group(3): John Smith group(4):   group(5): 01123

As you can see, I have created FIVE groups which are each enclosed in parentheses.

I included the !* and *! on either side to make it clearer. Note that none of those characters are in the RegEx and therefore will not be produced in the results. Group(0) merely gives you the entire matched string (all of my search criteria in one single line). Group 1 stops right before the first space because the space character was not included in the search criteria. Groups 2 and 4 are simply the white space, which in this case is literally a space character, but could also be a tab or a line feed etc. Group 3 includes the space because I put it in the search criteria ... etc.

Hope this makes sense.