Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Question marks in regular expressions

I'm reading the regular expressions reference and I'm thinking about ? and ?? characters. Could you explain me with some examples their usefulness? I don't understand them enough.

thank you

like image 930
xralf Avatar asked Apr 07 '11 15:04

xralf


People also ask

How do you use punctuation in regular expressions?

Some punctuation has special meaning in RegEx. It can get confusing if you are searching for things question marks, periods, and parentheses. For example, a period means “match any character.” The easiest way to get around this is to “escape” the character.

What is the meaning of asterisk (*) in regular expression?

The asterisk ( * ): The asterisk is known as a repeater symbol, meaning the preceding character can be found 0 or more times. For example, the regular expression ca*t will match the strings ct, cat, caat, caaat, etc.


2 Answers

This is an excellent question, and it took me a while to see the point of the lazy ?? quantifier myself.

? - Optional (greedy) quantifier

The usefulness of ? is easy enough to understand. If you wanted to find both http and https, you could use a pattern like this:

https? 

This pattern will match both inputs, because it makes the s optional.

?? - Optional (lazy) quantifier

?? is more subtle. It usually does the same thing ? does. It doesn't change the true/false result when you ask: "Does this input satisfy this regex?" Instead, it's relevant to the question: "Which part of this input matches this regex, and which parts belong in which groups?" If an input could satisfy the pattern in more than one way, the engine will decide how to group it based on ? vs. ?? (or * vs. *?, or + vs. +?).

Say you have a set of inputs that you want to validate and parse. Here's an (admittedly silly) example:

Input:        http123 https456 httpsomething  Expected result: Pass/Fail  Group 1   Group 2 Pass       http      123 Pass       https     456 Pass       http      something 

You try the first thing that comes to mind, which is this:

^(http)([a-z\d]+)$ 
Pass/Fail  Group 1   Group 2    Grouped correctly? Pass       http      123        Yes Pass       http      s456       No Pass       http      something  Yes 

They all pass, but you can't use the second set of results because you only wanted 456 in Group 2.

Fine, let's try again. Let's say Group 2 can be letters or numbers, but not both:

(https?)([a-z]+|\d+) 
Pass/Fail  Group 1   Group 2   Grouped correctly? Pass       http      123       Yes Pass       https     456       Yes Pass       https     omething  No 

Now the second input is fine, but the third one is grouped wrong because ? is greedy by default (the + is too, but the ? came first). When deciding whether the s is part of https? or [a-z]+|\d+, if the result is a pass either way, the regex engine will always pick the one on the left. So Group 2 loses s because Group 1 sucked it up.

To fix this, you make one tiny change:

(https??)([a-z]+|\d+)$ 
Pass/Fail  Group 1   Group 2    Grouped correctly? Pass       http      123        Yes Pass       https     456        Yes Pass       http      something  Yes 

Essentially, this means: "Match https if you have to, but see if this still passes when Group 1 is just http." The engine realizes that the s could work as part of [a-z]+|\d+, so it prefers to put it into Group 2.

like image 91
Justin Morgan Avatar answered Oct 10 '22 07:10

Justin Morgan


The key difference between ? and ?? concerns their laziness. ?? is lazy, ? is not.

Let's say you want to search for the word "car" in a body of text, but you don't want to be restricted to just the singular "car"; you also want to match against the plural "cars".

Here's an example sentence:

I own three cars.

Now, if I wanted to match the word "car" and I only wanted to get the string "car" in return, I would use the lazy ?? like so:

cars??

This says, "look for the word car or cars; if you find either, return car and nothing more".

Now, if I wanted to match against the same words ("car" or "cars") and I wanted to get the whole match in return, I'd use the non-lazy ? like so:

cars?

This says, "look for the word car or cars, and return either car or cars, whatever you find".

In the world of computer programming, lazy generally means "evaluating only as much as is needed". So the lazy ?? only returns as much as is needed to make a match; since the "s" in "cars" is optional, don't return it. On the flip side, non-lazy (sometimes called greedy) operations evaluate as much as possible, hence the ? returns all of the match, including the optional "s".

Personally, I find myself using ? as a way of making other regular expression operators lazy (like the * and + operators) more often than I use it for simple character optionality, but YMMV.

See it in Code

Here's the above implemented in Clojure as an example:

(re-find #"cars??" "I own three cars.") ;=> "car"  (re-find #"cars?" "I own three cars.") ;=> "cars" 

The item re-find is a function that takes its first argument as a regular expression #"cars??" and returns the first match it finds in the second argument "I own three cars."

like image 28
semperos Avatar answered Oct 10 '22 07:10

semperos