Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it better to use a non-greedy qualifier or a lookahead?

I have a possibly large block of text to search for instances of [[...]], where the ... can be anything, including other brackets (though they cannot be nested; the first instance of ]] after [[ ends the match).

I can think of two ways to match this text:

  • Using a non-greedy qualifier: /\[\[.+?\]\]/
  • Using a lookahead: /\[\[(?:(?!\]\]).)+\]\]/

Is one choice inherently better than the other, from a performance standpoint (I'd say the first is probably more readable)? I recall reading that it's better not to use non-greedy qualifiers, but I cannot find a source for that now.

like image 763
Daniel Vandersluis Avatar asked Jun 04 '10 16:06

Daniel Vandersluis


2 Answers

It is better to use a non-greedy quantifier in this case.

Take this example string "[[a]b]]"

Non-greedy quantifier

       \[\[.+?\]\]
Atom # 1 2 3  4 5
  1. Atom #1 \[ matches
  2. Atom #2 \[ matches
  3. Atom #3 .+? matches the "a"
  4. Atom #4 \] matches
  5. Atom #5 \] fails, back to #3 but keep string position
  6. Atom #3 .+? matches the "]"
  7. Atom #4 \] fails, back to #3 but keep string position
  8. Atom #3 .+? matches the "b"
  9. Atom #4 \] matches
  10. Atom #5 \] matches
  11. success

Look-ahead:

       \[\[(?:(?!\]\]).)+\]\]
Atom # 1 2 3  4       5  6 7
  1. Atom #1 \[ matches
  2. Atom #2 \[ matches
  3. Atom #4 (?!\]\]) succeeds (i.e. non-match) immediately at "a", go on
  4. Atom #5 . matches the "a", repeat at #3
  5. Atom #4 (?!\]\]) achieves partial match at "]"
  6. Atom #4 (?!\]\]) succeeds (i.e. non-match) at "b", go on
  7. Atom #5 . matches the "]", repeat at #3
  8. Atom #4 (?!\]\]) succeeds (i.e. non-match) immediately at "b", go on
  9. Atom #5 . matches the "b", repeat at #3
  10. Atom #4 (?!\]\]) achieves partial match at "]"
  11. Atom #4 (?!\]\]) achieves full match at "]", ergo: #4 fails, exit #3
  12. Atom #6 \] matches
  13. Atom #7 \] matches
  14. success

So it looks like the non-greedy quantifier has less work to do.

Disclaimer: This is an artificial example and real-life performance may differ, depending on the input, the actual expression and the implementation of the regex engine. I'm only 98% sure that what I outlined here is what is actually happening, so I'm open for corrections. Also, as with all performance tips, don't take this at face value, do your own benchmark comparisons if you want to know for sure.

like image 71
Tomalak Avatar answered Nov 15 '22 09:11

Tomalak


Another variant: /\[\[((?:\]?[^]])+)]]/

It uses neither non-greedy quantifiers nor look-aheads. It allows a single ] before any non-]. If there would be two ] in sequence, the inner repetition would stop, and and the match would end.

This pattern would be best to use with FSA-compiling regex engines. On back-tracking engines, it could get slower than the non-greedy variant.

like image 29
Markus Jarderot Avatar answered Nov 15 '22 08:11

Markus Jarderot