Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex ambiguity

Tags:

regex

Basic question about regex mechanics:

I have the following expression: [10]*1[10]*.

Would this match 100?

My reasoning:
first option: [10]* matches "100" and then gets to the end of the string => no match.
second option: The [10]* is disregarded and the expression matches.

Am I forgetting something trivial, or will this actually depend on the regex engine?
(I remember something about greedy vs not greedy but I am not sure if that applies to this case)

like image 783
user472875 Avatar asked Oct 25 '12 19:10

user472875


2 Answers

Regex engines do backtracking.

The engine tries to match 100 with [10]*, but that does not work, because then the 1 has nothing to match. But then the engine throws away the last character of the repetition (only using [10]* for 10) and tries again. Still does not work, because 1 does not match 0. The engine will throw away one character at a time, until the first [10*] is completely dropped. Now the 1 matches and [10]* gladly matches the rest.

I recommend reading through this tutorial, because it explains very well what is going on under the hood. (For you particular problem, check out the section on Repetition).

Some more detail:

This does not depend on whether the repetition is greedy or ungreedy. The regex engine will always backtrack. It will just start from the other end (with 0 occurrences of [10]) if you make it ungreedy like this: [10]*?. In this case that would speed up the process, because the first try would already match, but it would not change the fact, that it always matches.

In fact, you can manually prevent the engine from backtracking, by making the repetition "possessive". If you do this, and the repetition is first left, then the engine will not try other possible repetitions. This would be the syntax: [10]*+. Now the engine would match 100 only with that first part. Then matching 1 would fail, but since you made the repetition possessive, it would not go back to try different options to use [10]*. In this case that is useless, of course, but there are use cases where this behavior is desirable. And all of this is also covered in the linked tutorial. ;)

like image 160
Martin Ender Avatar answered Oct 12 '22 08:10

Martin Ender


The answer is, yes it matches, because the regex parser will consume as much from each sub-expression as it needs to achieve a match on the whole expression.

In your case, to match it will do this:

  • the first [10]* will consume zero chars
  • then it will match the literal 1
  • then the last [10]* will consume the remaining input


Finally, instead of asking here, why not try it out on regexpal and see for yourself!

like image 35
Bohemian Avatar answered Oct 12 '22 07:10

Bohemian