Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is For loop on Regex Matches slow?

I have the following code:

        string pattern = @"(?:\S+\s){1,6}\S*" + search + @"\S*(?:\s\S+){1,6}";
        String dbContents = row[2].ToString();
        var matches = Regex.Matches(dbContents, pattern, RegexOptions.IgnoreCase | RegexOptions.Compiled);
        for (int i = 0; i < matches.Count; i++)
        {
            if (i == 3)
                break;

            Contents += String.Format("... {0} ...", matches[i].Value);
        } 

What I'm trying to accomplish is to get one to six words before the search term and 1-6 words after the search term. When executing the code the performance hit on the for loop "matches.Count". With very large strings, its taking upwards of a min to execute. I'm confused on why and what to do to fix the issue.

like image 710
Chris Avatar asked Aug 25 '13 06:08

Chris


People also ask

Is Regex faster than for loop?

Regex is faster for large string than an if (perhaps in a for loops) to check if anything matches your requirement.

Is Regex matching slow?

The reason the regex is so slow is that the "*" quantifier is greedy by default, and so the first ". *" tries to match the whole string, and after that begins to backtrack character by character. The runtime is exponential in the count of numbers on a line.

Does compiling regex make it faster?

I created a much simpler test that will show you that compiled regular expressions are unquestionably faster than not compiled. Here, the compiled regular expression is 35% faster than the not compiled regular expression.


1 Answers

In order to find the count, that has to find all the matches in order to count them. Given that you're going to stop after three anyway, that seems a little pointless.

Use MatchCollection's lazy evaluation in combination with the Take method from LINQ to only take the first three matches. Usually it's a good idea to use StringBuilder rather than string concatenation in a loop, too:

StringBuilder builder = new StringBuilder(...);
foreach (var match in matches.Cast<Match>().Take(3))
{
    builder.AppendFormat("... {0} ...", matches[i].Value);
}

(The StringBuilder change probably isn't going to make much difference here, but it's a good habit to get into. The Cast method is required because Enumerable.Take only works on the generic IEnumerable<T> type.)

like image 124
Jon Skeet Avatar answered Sep 24 '22 01:09

Jon Skeet