Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Increasing Regex Efficiency

Tags:

c#

regex

I have about 100k Outlook mail items that have about 500-600 chars per Body. I have a list of 580 keywords that must search through each body, then append the words at the bottom.

I believe I've increased the efficiency of the majority of the function, but it still takes a lot of time. Even for 100 emails it takes about 4 seconds.

I run two functions for each keyword list (290 keywords each list).

       public List<string> Keyword_Search(HtmlNode nSearch)
    {
        var wordFound = new List<string>();
        foreach (string currWord in _keywordList)
        {
            bool isMatch = Regex.IsMatch(nSearch.InnerHtml, "\\b" + @currWord + "\\b",
                                                  RegexOptions.IgnoreCase);
            if (isMatch)
            {
                wordFound.Add(currWord);
            }
        }
        return wordFound;
    }

Is there anyway I can increase the efficiency of this function?

The other thing that might be slowing it down is that I use HTML Agility Pack to navigate through some nodes and pull out the body (nSearch.InnerHtml). The _keywordList is a List item, and not an array.

like image 743
cam Avatar asked Mar 30 '10 13:03

cam


People also ask

How do I create more efficient regular expressions?

Learn how to create more efficient regular expressions. Regular expressions are much slower than Data Identifiers, so use a Data Identifier whenever possible. If a Data Identifier does not fit the needs of a particular policy, regular expressions are still available. ? Escape; Example \. \* \+ \? (?: ) Use PCRE Compatible regex syntax.

How to check if regex is good performance?

Benchmarks may assure that regex has good performance. However, it’s not enough to test it on a single matching string. We need to try to move matching part inside the test string. It’s also important to check performance on a string that does not match, especially on a one that is almost OK, as it can cause most backtracking.

How can I Make my regex fail faster?

If you start with the one that is least likely to match, the regex will fail faster. This one is a bit of a micro-optimization, but it can give you a decent boost depending on your use case, and it can’t possibly hurt because the two expressions are equivalent. I ran a benchmark on the following two regexes:

Are regular expressions really that slow?

You have probably used regular expressions at least once, but have you ever thought about their performance? My experience shows that most of the time developers focus on correctness of a regex, leaving aside its performance. Yet matching a string with a regex can be surprisingly slow.


2 Answers

I assume that the COM call nSearch.InnerHtml is pretty slow and you repeat the call for every single word that you are checking. You can simply cache the result of the call:

public List<string> Keyword_Search(HtmlNode nSearch)
{
    var wordFound = new List<string>();

    // cache inner HTML
    string innerHtml = nSearch.InnerHtml;

    foreach (string currWord in _keywordList)
    {
        bool isMatch = Regex.IsMatch(innerHtml, "\\b" + @currWord + "\\b",
                                              RegexOptions.IgnoreCase);
        if (isMatch)
        {
            wordFound.Add(currWord);
        }
    }
    return wordFound;
}

Another optimization would be the one suggested by Jeff Yates. E.g. by using a single pattern:

string pattern = @"(\b(?:" + string.Join("|", _keywordList) + @")\b)";
like image 118
Dirk Vollmar Avatar answered Sep 30 '22 19:09

Dirk Vollmar


I don't think this is a job for regular expressions. You might be better off searching each message word by word and checking each word against your word list. With the approach you have, you're searching each message n times where n is the number of words you want to find - it's no wonder that it takes a while.

like image 39
Jeff Yates Avatar answered Sep 30 '22 18:09

Jeff Yates