Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Run multiple RegEx patterns on single string

Tags:

c#

regex

I need to run a C# RegEx match on a string. Problem is, I'm looking for more than one pattern on a single string, and I cannot find a way to do that with a single run.

For example, in the string

The dog has jumped

I'm looking for "dog" and for "dog has".

I don't know how can I get those two results with one pass.

I've tried to concatenate the pattern with the alteration symbol (|), like that:

(dog|dog has)

But it returned only the first match.

What can I use to get back both the matches?

Thanks!

like image 557
ml123 Avatar asked Dec 27 '14 21:12

ml123


2 Answers

The regex engine will return the first substring that satisfied the pattern. If you write (dog|dog has), it won't ever be able to match dog has because dog has starts with dog, which is the first alternative. Furthermore, the regex engine won't return overlapping matches.

Here's a convoluted method:

var patterns = new[] { "dog", "dog has" };

var sb = new StringBuilder();
for (var i = 0; i < patterns.Length; i++)
    sb.Append(@"(?=(?<p").Append(i).Append(">").Append(patterns[i]).Append("))?");

var regex = new Regex(sb.ToString(), RegexOptions.Compiled);
Console.WriteLine("Pattern: {0}", regex);

var input = "a dog has been seen with another dog";
Console.WriteLine("Input: {0}", input);

foreach (var match in regex.Matches(input).Cast<Match>())
{
    for (var i = 0; i < patterns.Length; i++)
    {
        var group = match.Groups["p" + i];
        if (!group.Success)
            continue;

        Console.WriteLine("Matched pattern #{0}: '{1}' at index {2}", i, group.Value, group.Index);
    }
}

This produces the following output:

Pattern: (?=(?<p0>dog))?(?=(?<p1>dog has))?
Input: a dog has been seen with another dog
Matched pattern #0: 'dog' at index 2
Matched pattern #1: 'dog has' at index 2
Matched pattern #0: 'dog' at index 33

Yes, this is an abuse of the regex engine :)

This works by building a pattern using optional lookaheads, which capture the substrings as a side effect, but the pattern otherwise always matches an empty string. So there are n+1 total matches, n being the input length. The patterns cannot contain numbered backreferences, but you can use named backreferences instead.

Also, this can return overlapping matches, as it will try to match all patterns at all string positions.

But you definitely should benchmark this against a manual approach (looping over the patterns and matching each of them separately). I don't expect this to be fast...

like image 107
Lucas Trzesniewski Avatar answered Nov 15 '22 06:11

Lucas Trzesniewski


You can use one regex pattern to do both.

Pattern: (dog\b has\b)|(dog\b)

I figured out this pattern using the online builder here: enter link description here

Then you can use it in C# with the regex class by doing something like

Regex reg = new Regex("(dog\b has\b)|(dog\b)", RegexOptions.IgnoreCase);
if (reg.IsMatch){
  //found dog or dog has
}
like image 38
Ryan Mann Avatar answered Nov 15 '22 06:11

Ryan Mann