Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Lazy quantifier {,}? not working as I would expect

I have an issue with lazy quantifiers. Or most likely I misunderstand how I am supposed to use them.

Testing on Regex101 My test string is let's say: 123456789D123456789

.{1,5} matches 12345

.{1,5}? matches 1

I am OK with both matches.

.{1,5}?D matches 56789D !! I would expect it to match 9D

Thanks for clarifying this.

like image 451
A.D. Avatar asked Mar 11 '16 15:03

A.D.


1 Answers

First and foremost, please do not think of greediness and laziness in regex as means of getting the longest/shortest match. "Greedy" and "lazy" terms only pertain to the rightmost character a pattern can match, it does not have any impact on the leftmost one. When you use a lazy quantifier, it will guarantee that the end of your matched substring will be the first found one, not the last found one (that would be returned with a greedy quantifier).

The regex engine analyzes a string from left to right. So, it searches for the first character that meets the pattern and then, once it finds the matching substring, it is returned as a match.

Let's see how it parses the string with .{1,5}D: 1 is found and D is tested for. No D after 1 is found, the regex engine expands the lazy quantifier and matches 12 and tries to match D. There is 3 after 2, again, the engine expands the lazy dot and does it 5 times. After expanding to the max value, it sees there is 12345 and the next character is not D. Since the engine reached the max limiting quantifier value, the match is failed, next location is tested.

The same scenario happens with the locations up to 5. When the engine reaches 5, it tries to match 5D, fails, tries 56D, fails, 567D, fails, 5678D - fails again, and when it tries to match 56789D - Bingo! - the match is found.

This makes it clear that a lazily quantified subpattern at the beginning of a pattern will act "greedily" by default, that is, it will not match the shortest substring.

Here is a visualization from regex101.com:

enter image description here

Now, here is a fun fact: .{1,5}? at the end of the pattern will always match 1 character (if there is any) because the requirement is to match at least 1, and it is sufficient to return a valid match. So, if you write D.{1,5}?, you will get D1 and D6 in 123456789D12345D678904.

Fun Fact 2: In .NET, you can "ask" the regex engine to analyze the string from right to left with the help of RightToLeft modifier. Then, with .{1,5}?D, you will get 9D, see this demo.

Fun fact 3: In .NET, (?<=(.{1,5}?))D will capture 9 into Group 1 if 123456789D is passed as input. This happens because of the way the lookbehind is implemented in .NET regex (.NET reverses the string as well as the pattern inside the lookbehind, then attempts to match that single pattern on the reversed string). And in Java, (?<=(.{1,5}))D (the greedy version) will capture 9 because it tries all the possible fixed-width patterns in the range, from the shortest to the longest, until one succeeds.

And a solution is: if you know you need 1 character followed with D, just use

/.D/
like image 62
Wiktor Stribiżew Avatar answered Nov 04 '22 02:11

Wiktor Stribiżew