Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular expression greedy match not working as expected

I have a very basic regular expression that I just can't figure out why it's not working so the question is two parts. Why does my current version not work and what is the correct expression.

Rules are pretty simple:

  1. Must have minimum 3 characters.
  2. If a % character is the first character must be a minimum of 4 characters.

So the following cases should work out as follows:

  • AB - fail
  • ABC - pass
  • ABCDEFG - pass
  • % - fail
  • %AB - fail
  • %ABC - pass
  • %ABCDEFG - pass
  • %%AB - pass

The expression I am using is:

^%?\S{3}

Which to me means:

  • ^ - Start of string
  • %? - Greedy check for 0 or 1 % character
  • \S{3} - 3 other characters that are not white space

The problem is, the %? for some reason is not doing a greedy check. It's not eating the % character if it exists so the '%AB' case is passing which I think should be failing. Why is the %? not eating the % character?

Someone please show me the light :)

Edit: The answer I used was Dav below: ^(%\S{3}|[^%\s]\S{2}) Although it was a 2 part answer and Alan's really made me understand why. I didn't use his version of ^(?>%?)\S{3} because it worked but not in the javascript implementation. Both great answers and a lot of help.

like image 627
Kelsey Avatar asked Dec 02 '22 07:12

Kelsey


2 Answers

The word for the behavior you described isn't greedy, it's possessive. Normal, greedy quantifiers match as much as they can originally, but back off if necessary to allow the whole regex to match (I like to think of them as greedy but accommodating). That's what's happening to you: the %? originally matches the leading percent sign, but if there aren't enough characters left for an overall match, it gives up the percent sign and lets \S{3} match it instead.

Some regex flavors (including Java and PHP) support possessive quantifiers, which never back off, even if that causes the overall match to fail. .NET doesn't have those, but it has the next best thing: atomic groups. Whatever you put inside an atomic group acts like a separate regex--it either matches at the position where it's applied or it doesn't, but it never goes back and tries to match more or less than it originally did just because the rest of the regex is failing (that is, the regex engine never backtracks into the atomic group). Here's how you would use it for your problem:

^(?>%?)\S{3}

If the string starts with a percent sign, the (?>%?) matches it, and if there aren't enough characters left for \S{3} to match, the regex fails.

Note that atomic groups (or possessive quantifiers) are not necessary to solve this problem, as @Dav demonstrated. But they're very powerful tools which can easily make the difference between impossible and possible, or too damn slow and slick as can be.

like image 160
Alan Moore Avatar answered Dec 22 '22 08:12

Alan Moore


Regex will always try to match the whole pattern if it can - "greedy" doesn't mean "will always grab the character if it exists", but instead means "will always grab the character if it exists and a match can be made with it grabbed".

Instead, what you probably want is something like this:

^(%\S{3}|[^%\s]\S{2})

Which will match either a % followed by 3 characters, or a non-%, non-whitespace followed by 2 more.

like image 35
Amber Avatar answered Dec 22 '22 07:12

Amber