For the common problem of matching text between delimiters (e.g. <code><</code> and <code>></code>), there's two common patterns: <ul> <li>using the greedy <code>*</code> or <code>+</code> quantifier in the form <code>START [^END]* END</code>, e.g. <code><[^>]*></code>, or</li> <li>using the lazy <code>*?</code> or <code>+?</code> quantifier in the form <code>START .*? END</code>, e.g. <code><.*?></code>.</li> </ul> Is there a particular reason to favour one over the other?

Some advantages: <code>[^>]*</code>: <ul> <li>More expressive.</li> <li>Captures newlines regardless of <code>/s</code> flag.</li> <li>Considered quicker, because the engine doesn't have to backtracks to find a successful match (with <code>[^>]</code> the engine doesn't make choices - we give it only one way to match the pattern against the string).</li> </ul> <code>.*?</code> <ul> <li>No "code duplication" - the end character only appears once.</li> <li>Simpler in cases the end delimiter is more than a character long. (a character class would not work in this case) A common alternative is <code>(?:(?!END).)*</code>. This is even worse if the END delimiter is another pattern. </li> </ul>

The first is more explicit, i. e. it definitely excludes the closing delimiter from being part of the matched text. This is not guaranteed in the second case (if the regular expression is extended to match more than just this tag). Example: If you try to match <code><tag1><tag2>Hello!</code> with <code><.*?>Hello!</code>, the regex will match <pre class="prettyprint"><code><tag1><tag2>Hello! </code></pre> whereas <code><[^>]*>Hello!</code> will match <pre class="prettyprint"><code><tag2>Hello! </code></pre>

Matching text between delimiters: greedy or lazy regular expression?

2 Answers

Some advantages:

[^>]*:

More expressive.
Captures newlines regardless of /s flag.
Considered quicker, because the engine doesn't have to backtracks to find a successful match (with [^>] the engine doesn't make choices - we give it only one way to match the pattern against the string).

.*?

No "code duplication" - the end character only appears once.
Simpler in cases the end delimiter is more than a character long. (a character class would not work in this case) A common alternative is (?:(?!END).)*. This is even worse if the END delimiter is another pattern.

178

answered Oct 02 '22 15:10

Kobi

The first is more explicit, i. e. it definitely excludes the closing delimiter from being part of the matched text. This is not guaranteed in the second case (if the regular expression is extended to match more than just this tag).

Example: If you try to match <tag1><tag2>Hello! with <.*?>Hello!, the regex will match

<tag1><tag2>Hello!

whereas <[^>]*>Hello! will match

<tag2>Hello!

answered Oct 02 '22 14:10

Tim Pietzcker

Related questions
                            
                                Negative look-ahead assertion in list.files in R
                            
                                C++11 regex: digit after capturing group in replacement string
                            
                                Iranian postal code validation
                            
                                RewriteCond in .htaccess with negated regex condition doesn't work?
                            
                                PyCharm and filters for external tools
                            
                                Why are C# compiled regular expressions faster than equivalent string methods?
                            
                                Elegant R function: mixed case separated by periods to underscore separated lower case and/or camel case
                            
                                Regex in Linq statement?
                            
                                Glob Sync Pattern on multiple directories
                            
                                re.findall('(ab|cd)', string) vs re.findall('(ab|cd)+', string)
                            
                                RewriteRule ^ - [L] AKA RewriteRule caret dash L
                            
                                Regex to match all words except a given list
                            
                                Python Regex, re.sub, replacing multiple parts of pattern?
                            
                                Understanding Regular Expressions
                            
                                Regular Expressions in SQL Server servers?
                            
                                Substitute the n-th occurrence of a word in vim
                            
                                Operator precedence in regular expressions
                            
                                Is there a shorter way to pull groups out of a Powershell regex?
                            
                                Aren't modern regular expression dialects regular?
                            
                                How do I write a simple regular expression pattern matching function in C or C++?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Matching text between delimiters: greedy or lazy regular expression?

Tags:

language-agnostic

regex

regex-greedy

greedy

Heinzi

People also ask

2 Answers

Kobi

Tim Pietzcker

Recent Activity

Donate For Us