Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex lazy vs greedy confusion

Tags:

regex

I'm a little confused about regular expressions and greedy vs lazy. It's really very simple and it feels like I'm missing something obvious.

I've simplified my problem as much as I can to make it clear. Consider the following string and regex pattern.

string:
aaxxxb

pattern:
(?<=a)(.*?)(?=b)

result:
axxx

what I expected:
xxx

This result is what I would expect from using .* instead of .*?, what am I missing?

Obviously, same thing if I use a.*?b gives me aaxxxb. Why is this? Shouldn't lazy (like .*?) return as few characters as possible?

like image 223
user1277327 Avatar asked Apr 19 '14 22:04

user1277327


1 Answers

You are missing the fact that a regex engine works from left to right, position by position, and succeeds as soon as it finds a match at the current position.

In your example, the first position where the pattern succeeds is at the second "a".

The laziness works only on the right side.

If you want to obtain "xxx", a better way is to use a negated character class [^ab]* instead of .*?

Note: not exactly related to the subject, but good to know: a DFA regex engine will try to get the largest result in case of alternation, a NFA gives you the first that succeeds.

like image 81
Casimir et Hippolyte Avatar answered Sep 22 '22 05:09

Casimir et Hippolyte