Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which would be better non-greedy regex or negated character class?

I need to match @anything_here@ from a string @anything_here@dhhhd@shdjhjs@. So I'd used following regex.

^@.*?@

or

^@[^@]*@

Both way it's work but I would like to know which one would be a better solution. Regex with non-greedy repetition or regex with negated character class?

like image 537
Pranav C Balan Avatar asked Dec 21 '16 18:12

Pranav C Balan


People also ask

What character's do you use to stop a regex from being greedy?

You can make the default quantifiers ? , * , + , {m} , and {m,n} non-greedy by appending a question mark symbol '?' to them: ?? , *? , +? , and {m,n}? . they “consume” or match as few characters as possible so that the regex pattern is still satisfied.

Is regex matching greedy?

The standard quantifiers in regular expressions are greedy, meaning they match as much as they can, only giving back as necessary to match the remainder of the regex. By using a lazy quantifier, the expression tries the minimal match first.

Why is regex greedy?

In general, the regex engine will try to match as many input characters as possible once it encounters a quantified token like \d+ or, in our case, . * . That behavior is called greedy matching because the engine will eagerly attempt to match anything it can.

Is regex greedy by default?

If you've ever found yourself pulling your hair out trying to build the perfect regular expression to match the least amount of data possible, then non-greedy Perl regex are what you need. By default, Perl regular expressions are greedy, meaning they will match as much data as possible before a new line.


2 Answers

It is clear the ^@[^@]*@ option is much better.

The negated character class is quantified greedily which means the regex engine grabs 0 or more chars other than @ right away, as many as possible. See this regex demo and matching:

enter image description here

When you use a lazy dot matching pattern, the engine matches @, then tries to match the trailing @ (skipping the .*?). It does not find the @ at Index 1, so the .*? matches the a char. This .*? pattern expands as many times as there are chars other than @ up to the first @.

See the lazy dot matching based pattern demo here and here is the matching steps:

enter image description here

like image 141
Wiktor Stribiżew Avatar answered Oct 19 '22 11:10

Wiktor Stribiżew


Negated character classes should usually be prefered over lazy matching, if possible.

If the regex is successful, ^@[^@]*@ can match the content between @s in a single step, while ^@.*?@ needs to expand for each character between @s.

When failing (for the case of no ending @) most regex engines will apply a little magic and internally treat [^@]* as [^@]*+, as there is a clear cut border between @ and non-@, thus it will match to the end of the string, recognize the missing @ and not backtrack, but instantly fail. .*? will expand character for character as usual.

When used in larger contexts, [^@]* will also never expand over the borders of the ending @ while this is very well possible for the lazy matching. E.g. ^@[^@]*a[^@]*@ won't match @bbbb@a@ while ^@.*?a.*?@ will.

Note that [^@] will also match newlines, while . doesn't (in most regex engines and unless used in singleline mode). You can avoid this by adding the newline character to the negation - if it is not wanted.

like image 41
Sebastian Proske Avatar answered Oct 19 '22 11:10

Sebastian Proske