Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Matching text between delimiters: greedy or lazy regular expression?

For the common problem of matching text between delimiters (e.g. < and >), there's two common patterns:

  • using the greedy * or + quantifier in the form START [^END]* END, e.g. <[^>]*>, or
  • using the lazy *? or +? quantifier in the form START .*? END, e.g. <.*?>.

Is there a particular reason to favour one over the other?

like image 595
Heinzi Avatar asked Aug 29 '11 08:08

Heinzi


People also ask

What is the difference between lazy matching and greedy matching in regular expressions?

'Greedy' means match longest possible string. 'Lazy' means match shortest possible string.

How do I stop regex from being greedy?

You make it non-greedy by using ". *?" When using the latter construct, the regex engine will, at every step it matches text into the "." attempt to match whatever make come after the ". *?" . This means that if for instance nothing comes after the ".

What is a greedy match and non-greedy match?

It means the greedy quantifiers will match their preceding elements as much as possible to return to the biggest match possible. On the other hand, the non-greedy quantifiers will match as little as possible to return the smallest match possible. non-greedy quantifiers are the opposite of greedy ones.

What is greedy and non-greedy in regex?

So the difference between the greedy and the non-greedy match is the following: The greedy match will try to match as many repetitions of the quantified pattern as possible. The non-greedy match will try to match as few repetitions of the quantified pattern as possible.


2 Answers

Some advantages:

[^>]*:

  • More expressive.
  • Captures newlines regardless of /s flag.
  • Considered quicker, because the engine doesn't have to backtracks to find a successful match (with [^>] the engine doesn't make choices - we give it only one way to match the pattern against the string).

.*?

  • No "code duplication" - the end character only appears once.
  • Simpler in cases the end delimiter is more than a character long. (a character class would not work in this case) A common alternative is (?:(?!END).)*. This is even worse if the END delimiter is another pattern.
like image 178
Kobi Avatar answered Oct 02 '22 15:10

Kobi


The first is more explicit, i. e. it definitely excludes the closing delimiter from being part of the matched text. This is not guaranteed in the second case (if the regular expression is extended to match more than just this tag).

Example: If you try to match <tag1><tag2>Hello! with <.*?>Hello!, the regex will match

<tag1><tag2>Hello!

whereas <[^>]*>Hello! will match

<tag2>Hello!
like image 24
Tim Pietzcker Avatar answered Oct 02 '22 14:10

Tim Pietzcker