Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Documentation for ?: in regex?

Tags:

regex

php

A while ago, I saw in regex (at least in PHP) you can make a capturing group not capture by prepending ?:.

Example

$str = 'big blue ball'; $regex = '/b(ig|all)/'; preg_match_all($regex, $str, $matches); var_dump($matches); 

Outputs...

array(2) {   [0]=>   array(2) {     [0]=>     string(3) "big"     [1]=>     string(4) "ball"   }   [1]=>   array(2) {     [0]=>     string(2) "ig"     [1]=>     string(3) "all"   } } 

In this example, I don't care about what was matched in the parenthesis, so I appended the ?: ('/b(?:ig|all)/') and got output

array(1) {   [0]=>   array(2) {     [0]=>     string(3) "big"     [1]=>     string(4) "ball"   } } 

This is very useful - at least I think so. Sometimes you just don't want to clutter your matches with unnecessary values.

I was trying to look up documentation and the official name for this (I call it a non capturing group, but I think I've heard it before).

Being symbols, it seemed hard to Google for.

I have also looked at a number of regex reference guides, with no mention.

Being prefixed with ?, and appearing in the first chars inside parenthesis would leave me to believe it has something to do with lookaheads or lookbehinds.

So, what is the proper name for these, and where can I learn more?

like image 955
alex Avatar asked Sep 28 '10 12:09

alex


People also ask

What is ?: In regex?

'a' (which in this case ?: is doing it is matching with a string but it is excluding whatever comes after it means it will match the string but not whitespace(taking into account match(numbers or strings) not additional things with them.)

What does ?: Mean in regex python?

Python docs: (?:...) A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.

How do you denote special characters in regex?

To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).

How do you write rules in regex?

For example, the terminology rule regular expression, "/a.b/", matches all text where there is an "a" followed by any single character, followed by a "b", as in, "a5b". The asterisk matches the preceding pattern or character zero or more times. Combining the period and asterisk, "/a.


2 Answers

It's available on the Subpatterns page of the official documentation.

The fact that plain parentheses fulfill two functions is not always helpful. There are often times when a grouping subpattern is required without a capturing requirement. If an opening parenthesis is followed by "?:", the subpattern does not do any capturing, and is not counted when computing the number of any subsequent capturing subpatterns. For example, if the string "the white queen" is matched against the pattern the ((?:red|white) (king|queen)) the captured substrings are "white queen" and "queen", and are numbered 1 and 2. The maximum number of captured substrings is 99, and the maximum number of all subpatterns, both capturing and non-capturing, is 200.

It's also good to note that you can set options for the subpattern with it. For example, if you want only the sub-pattern to be case insensitive, you can do:

(?i:foo)bar 

Will match:

  • foobar
  • Foobar
  • FoObar
  • ...etc

But not

  • fooBar
  • FooBAR
  • ...etc

Oh, and while the official documentation doesn't actually explicitly name the syntax, it does refer to it later on as a "non-capturing subpattern" (which makes complete sense, and is what I would call it anyway, since it's not really a "group", but a subpattern)...

like image 121
ircmaxell Avatar answered Oct 02 '22 10:10

ircmaxell


(?:) as a whole represents a non-capturing group.

Regular-expressions.info mentions this syntax :

The question mark and the colon after the opening round bracket are the special syntax that you can use to tell the regex engine that this pair of brackets should not create a backreference. Note the question mark [...] is the regex operator that makes the previous token optional. This operator cannot appear after an opening round bracket, because an opening bracket by itself is not a valid regex token. Therefore, there is no confusion between the question mark as an operator to make a token optional, and the question mark as a character to change the properties of a pair of round brackets. The colon indicates that the change we want to make is to turn off capturing the backreference.

like image 45
Julien Hoarau Avatar answered Oct 02 '22 09:10

Julien Hoarau