Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

At which position does the regex fail?

I need a very simple string validator that would show where is first symbol not corresponding to the desired format. I want to use regex but in this case I have to find the place where the string stops corresponding to the expression and I can't find a method that would do that. (It's got to be a fairly simple method... maybe there isn't one?)

For example if I have regex:

/^Q+E+R+$/

with string:

"QQQQEEE2ER"

The desired result should be 7

like image 671
Morozov Avatar asked Jun 02 '14 18:06

Morozov


2 Answers

An idea: what you can do is to tokenize your pattern and write it with optional nested capturing groups:

^(Q+(E+(R+($)?)?)?)?

Then you only need to count the number of capture groups you obtain to know where the regex engine stops in the pattern and you can determine the offset of the match end in the string with the whole match length.

As @zx81 notices it in his comment, if one of the elements can match the next element (example Q can match the element E), things become different.

Let's say that Q is \w (and can match E and R). For the string QQQEEERRR the precedent pattern will give only one capturing group (the greedy \w+ matches all) when ^(\w+)(E+)(R+)$ will give three groups: QQQEE, E, RRR

To obtain the same result you need to add an alternation:

^((?:\w+(?=E)|\w+)(E+(R+($)?)?)?)?

In the alternation, the case where E exists must be tested first, and only if this branch fails (with the lookahead), then the other branch where E doesn't exist is used.

Thus the full pattern can be rewritten like this to deal with this specific case:

^((?:Q+(?=E)|Q+)((?:E+(?=R)|E+)((?:R+(?=$)|R+)($)?)?)?)?

Perhaps could you take a look to the gem amatch too.

like image 100
Casimir et Hippolyte Avatar answered Sep 28 '22 05:09

Casimir et Hippolyte


This is an interesting task that can be accomplished with a neat regex trick:

^(?:(?=(Q+)))?(?:(?=(Q+E+)))?(?:(?=(Q+E+R+)))?(?:(?=(Q+E+R+$)))?

We have four optional lookaheads checking various parts of the pattern and capturing the partial matches to Groups 1, 2, 3 and 4 incrementally.

  1. Group 1 contains Q+ if it can be matched, in your example QQQQ.
  2. Group 2 contains Q+E+ if it can be matched, in your example EEE.
  3. Group 3 contains Q+E+R+ if it can be matched, in your example nil.
  4. Group 3 contains Q+E+R+$ if it can be matched, in your example nil.

In your code, check which is the last Group that is set by testing !$1.nil?, !$2.nil? and so on.

The last one set gives you the length that is matchable, so in your example $2.length gives you the 7 you wanted.

Incidentally, the fact that Group 2 is the last one set also tells you that we fail on R+.

like image 26
zx81 Avatar answered Sep 28 '22 05:09

zx81