Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Make a non-greedy RegEx in backward direction to behave the same like in forward direction

This pattern:

/a+?b+?/

Against the following string:

aaaaaabbbbbb

Matches:

aaaaaab

We see that the non-greedy behaves different in backward/left direction (takes all) and forward/right direction (takes just one).

Is there a way to make the non-greedy at the beginning, that matches all the a, to match as less as possible too? So that it behaves in the same way like at with the b part at the end?

like image 377
flori Avatar asked Mar 03 '13 21:03

flori


1 Answers

The short answer

Regexes generally match from left-to-right unless you set a right-to-left flag (which very few flavors support). In either case, they do not start in the middle and then work out in both directions, even if you use a lookbehind.

How do lazy quantifiers work?

It helps to stop and ask - why does the lazy quantifier exist in the first place? What problem was it meant to solve?

Normal (greedy) quantifiers work by finding a matching pattern of text and then repeatedly matching a sequence of characters until they can match no more. This behavior is usually desired, but you run into problems when you have a very general pattern followed by a very specific pattern where the specific pattern is a subset of the general pattern.

For example, consider the following input:

_abc_END_def_END

And this pattern:

(\w+END)

The intent was to match _abc_ and then END. The problem is that END is a subset of \w+. Using standard "greedy" rules, the \w+ matches as much as possible. So rather than matching _abc_, it matched _abc_END_def.

The solution to this scenario is to change the way the quantifier (+) behaves with the lazy modifier ?. By changing the expression to \w+?, the regex engine is forced to match only as much as necessary to satisfy the expression and no more. The expression is satisfied when \w+? matches _abc_ and END matches its literal string.

The purpose of the lazy quantifier is not to match a "minimum" number of characters - it is about giving that second pattern, a subset of the first, an opportunity to match.

Going back to your question

In your example, b is not a subset of a, so there is no need for the lazy quantifier. If you want to match one or more a's, but as few as possible, and one or more b's, but as few as possible, then you'd simply use:

ab

Or, if your a is a stand-in for some superset which may include b:

[ab]b

For example:

\wb

Both of which would match:

ab

Example:

const input = "aaabbb"

console.log(/ab/.exec(input)[0])
like image 133
JDB Avatar answered Sep 18 '22 09:09

JDB