Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PHP regex non-capture non-match group

I'm making a date matching regex, and it's all going pretty well, I've got this so far:

"/(?:[0-3])?[0-9]-(?:[0-1])?[0-9]-(?:20)[0-1][0-9]/"

It will (hopefully) match single or double digit days and months, and double or quadruple digit years in the 21st century. A few trials and errors have gotten me this far.

But, I've got two simple questions regarding these results:

  1. (?: ) what is a simple explanation for this? Apparently it's a non-matching group. But then...

  2. What is the trailing ? for? e.g. (? )?

like image 765
Ben Avatar asked May 10 '11 03:05

Ben


2 Answers

[Edited (again) to improve formatting and fix the intro.]

This is a comment and an answer.

The answer part... I do agree with alex' earlier answer.

  1. (?: ), in contrast to ( ), is used to avoid capturing text, generally so as to have fewer back references thrown in with those you do want or to improve speed performance.

  2. The ? following the (?: ) -- or when following anything except * + ? or {} -- means that the preceding item may or may not be found within a legitimate match. Eg, /z34?/ will match z3 as well as z34 but it won't match z35 or z etc.

The comment part... I made what might considered to be improvements to the regex you were working on:

(?:^|\s)(0?[1-9]|[1-2][0-9]|30|31)-(0?[1-9]|10|11|12)-((?:20)?[0-9][0-9])(?:\s|$)

-- First, it avoids things like 0-0-2011

-- Second, it avoids things like 233443-4-201154564

-- Third, it includes things like 1-1-2022

-- Forth, it includes things like 1-1-11

-- Fifth, it avoids things like 34-4-11

-- Sixth, it allows you to capture the day, month, and year so you can refer to these more easily in code.. code that would, for example, do a further check (is the second captured group 2 and is either the first captured group 29 and this a leap year or else the first captured group is <29) in order to see if a feb 29 date qualified or not.

Finally, note that you'll still get dates that won't exist, eg, 31-6-11. If you want to avoid these, then try:

(?:^|\s)(?:(?:(0?[1-9]|[1-2][0-9]|30|31)-(0?[13578]|10|12))|(?:(0?[1-9]|[1-2][0-9]|30)-(0?[469]|11))|(?:(0?[1-9]|[1-2][0-9])-(0?2)))-((?:20)?[0-9][0-9])(?:\s|$)

Also, I assumed the dates would be preceded and followed by a space (or beg/end of line), but you may want ot adjust that (eg, to allow punctuations).

A commenter elsewhere referenced this resource which you might find useful: http://rubular.com/

like image 115
Jose_X Avatar answered Oct 23 '22 03:10

Jose_X


  1. It is a non capturing group. You can not back reference it. Usually used to declutter backreferences and/or increase performance.
  2. It means the previous capturing group is optional.
like image 32
alex Avatar answered Oct 23 '22 04:10

alex