Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex is behaving lazy, should be greedy

I thought that by default my Regex would exhibit the greedy behavior that I want, but it is not in the following code:

 Regex keywords = new Regex(@"in|int|into|internal|interface");
 var targets = keywords.ToString().Split('|');
 foreach (string t in targets)
    {
    Match match = keywords.Match(t);
    Console.WriteLine("Matched {0,-9} with {1}", t, match.Value);
    }

Output:

Matched in        with in
Matched int       with in
Matched into      with in
Matched internal  with in
Matched interface with in

Now I realize that I could get it to work for this small example if I simply sorted the keywords by length descending, but

  • I want to understand why this isn't working as expected, and
  • the actual project I am working on has many more words in the Regex and it is important to keep them in alphabetical order.

So my question is: Why is this being lazy and how do I fix it?

like image 419
Stomp Avatar asked Mar 07 '10 02:03

Stomp


2 Answers

Laziness and greediness applies to quantifiers only (?, *, +, {min,max}). Alternations always match in order and try the first possible match.

like image 89
Max Shawabkeh Avatar answered Oct 06 '22 01:10

Max Shawabkeh


It looks like you're trying to word break things. To do that you need the entire expression to be correct, your current one is not. Try this one instead..

new Regex(@"\b(in|int|into|internal|interface)\b");

The "\b" says to match word boundaries, and is a zero-width match. This is locale dependent behavior, but in general this means whitespace and punctuation. Being a zero width match it will not contain the character that caused the regex engine to detect the word boundary.

like image 44
Jason D Avatar answered Oct 06 '22 00:10

Jason D