Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regexp lazy quantifier

Tags:

regex

I have a sentence like this

a something* q b c w

and i have to match a and q together like

(id_1: a, id_2: q)

b alone like

(id_1: b)

and c and w together like (id_1:c id_2:w)

I tried to use this regexp

(?:\b(?P<id_1>a|b|c)\b(?:.*?)(?P<id_2>q|w)?\b)

Because of the lazy operator .*? the regexp matches only the first part of the sentence, matching only

(id_1: a, id_1: b, id_1: c)

Live Example

If we use a greedy operator such that the expression becomes

(?:\b(?P<id_1>a|b|c)\b(?:.*)(?P<id_2>q|w)?\b)

Live Example

It matches

(id_1: a)

an everything after is matched as .* .

If the second part is mandatory (with lazy on .* ):

(?:\b(?P<id_1>a|b|c)\b(?:.*?)(?P<id_2>q|w)\b)

Live Example

it matches sentences like

(id_1: a, id_2: q);(id_1: b, id_2: w)

as expected.

It is possible to use a regular expression that "prefers" matching the whole sentence (including the optional part) or that matches only the first part ONLY if the optional one is missing.

EDIT: Sorry the regexes provided had some errors in them.

The last regex is:

(?:\b(?P<id_1>a|b|c)\b(?:.*?)(?P<id_2>q|w)\b)

and it requires both group to be mandatory. It matches "a something* w" but it doesn't match "a something*" or just "a". I need to match "a something* w" as well as "a" and "a w" and get the matching group respectively:

(id_1: a , id_2: w) ; (id_1: a, id_2: none) ; (id_1:a , id_2: w)

I think that the regex required is:

(?:\b(?P<id_1>a|b|c)\b(?:.*?)(?P<id_2>q|w)?\b)

but in the sentence "a something* w" it just matches "a" (due to the lazy operator on .*).

I have also updated all the live examples.

like image 989
Desh901 Avatar asked Nov 09 '22 00:11

Desh901


1 Answers

The lazy dot matching is a problem root cause since it requires a trailing boundary to exist.

If you need to match some text that is not specific text, you can use 2 things: either a tempered greedy token or an unroll-the-loop based regex.

If you have variables you can use a tempered greedy token and make the second capture group optional with ? quantifier:

\b(?P<id_1>a|b|c)\b(?:(?!\b(?:a|b|c|q|w)\b).)*(?P<id_2>q|w)?\b
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^             ^

See regex demo

like image 52
Wiktor Stribiżew Avatar answered Dec 22 '22 15:12

Wiktor Stribiżew