Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does the ? make a quantifier lazy in regex

I've been looking into regex lately and figured that the ? operator makes the *,+, or ? lazy. My question is how does it do that? Is it that *? for example is a special operator, or does the ? have an effect on the * ? In other words, does regex recognize *? as one operator in itself, or does regex recognize *? as the two separate operators * and ? ? If it is the case that *? is being recognized as two separate operators, how does the ? affect the * to make it lazy. If ? means that the * is optional, shouldn't this mean that the * doesn't have to exists at all. If so, then in a statement .*? wouldn't regex just match separate letters and the whole string instead of the shorter string? Please explain, I'm desperate to understand.Many thanks.

like image 544
Uriel Katz Avatar asked Nov 28 '22 08:11

Uriel Katz


2 Answers

? can mean a lot of different things in different contexts.

  • Following a normal regex token (a character, a shorthand, a character class, a group...), it means "Match the previous item 0-1 times".
  • Following a quantifier like ?, *, +, {n,m}, it takes on a different meaning: "Make the previous quantifier lazy instead of greedy (if that's the default; that can be changed, though - for example in PHP, the /U modifier makes all quantifiers lazy by default, so the additional ? makes them greedy).
  • Right after an opening parenthesis, it marks the start of a special construct like for example

    a) (?s): mode modifiers ("turn on dotall mode")
    b) (?:...): make the group non-capturing
    c) (?=...) or (?!...): lookahead assertion
    d) (?<=...) or (?<!...): lookbehind assertion
    e) (?>...): atomic group
    f) (?<foo>...): named capturing group
    g) (?#comment): inline comments, ignored by the regex engine
    h) (?(?=if)then|else): conditionals

and others. Not all constructs are available in all regex flavors.

  • Within a character class ([?]), it simply matches a verbatim ?.
like image 86
Tim Pietzcker Avatar answered Dec 05 '22 05:12

Tim Pietzcker


I think a little history will make it easier to understand. When the Larry Wall wanted to grow regex syntax to support new features, his options were severely limited. He couldn't just decree (for example) that % is now a metacharacter that supports new feature "XYZ". That would break the millions of existing regexes that happened to use % to match a literal percent sign.

What he could do is take an already-defined metacharacter and use it in such a way that its original function wouldn't make sense. For example, any regex that contained two quantifiers in a row would be invalid, so it was safe to say a ? after another quantifier now turns it into a reluctant quantifier (a much better name than "lazy" IMO; non-greedy good too). So the answer to your question is that ? doesn't modify the *, *? is a single entity: a reluctant quantifier. The same is true of the + in possessive quantifiers (*+, {0,2}+ etc.).

A similar process occurred with group syntax. It would never make sense to have a quantifier after an unescaped opening parenthesis, so it was safe to say (? now marks the beginning of a special group construct. But the question mark alone would only support one new feature, so the ? itself to be followed has to be followed by at least one more character to indicate which kind of group it is ((?:...), (?<!...), etc.). Again, the (?: is a single entity: the opening delimiter of a non-capturing group.

I don't know offhand why he used the question mark both times. I do know Perl 6 Rules (a bottom-up rewrite of Perl 5 regexes) has done away with all that crap and uses an infinitely more sensible syntax.

like image 25
Alan Moore Avatar answered Dec 05 '22 05:12

Alan Moore