Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Slow Regex performance

The code below contains a regular expression designed to extract a C# string literal but the performance of the regex matching for input strings of more than a few characters is woeful.

class Program
{
   private static void StringMatch(string s)
    {
        // regex: quote, zero-or-more-(zero-or-more-non-backslash-quote, optional-backslash-anychar), quote
        Match m = Regex.Match(s, "\"(([^\\\\\"]*)(\\\\.)?)*\"");
        if (m.Success)
            Trace.WriteLine(m.Value);
        else
            Trace.WriteLine("no match");
    }

    public static void Main()
    {
        // this first string is unterminated (so the match fails), but it returns instantly
        StringMatch("\"OK");

        // this string is terminated (the match succeeds)
        StringMatch("\"This is a longer terminated string - it matches and returns instantly\"");

        // this string is unterminated (so the match will fail), but it never returns
        StringMatch("\"This is another unterminated string and takes FOREVER to match");
    }
}

I can refactor the regex into a different form, but can anyone offer an explanation why the performance is so bad?

like image 766
adelphus Avatar asked Mar 13 '12 16:03

adelphus


People also ask

Is regex bad for performance?

My experience shows that most of the time developers focus on correctness of a regex, leaving aside its performance. Yet matching a string with a regex can be surprisingly slow. So slow it can even stop any JS app or take 100% of a server CPU time causing denial of service (DOS).

Is there anything faster than regex?

String operations will always be faster than regular expression operations. Unless, of course, you write the string operations in an inefficient way. Regular expressions have to be parsed, and code generated to perform the operation using string operations.

Does compiling regex make it faster?

Regex has an interpreted mode and a compiled mode. The compiled mode takes longer to start, but is generally faster. Some users want both startup time and performance; other users want to run on a platform where JITted code is not allowed, and also want performance.

Are regular expression fast?

Regular expression matching can be simple and fast, using finite automata-based techniques that have been known for decades. In contrast, Perl, PCRE, Python, Ruby, Java, and many other languages have regular expression implementations based on recursive backtracking that are simple but can be excruciatingly slow.


2 Answers

You're running into catastrophic backtracking:

Let's simplify the regex a bit (without the escaped quotes and without the second optional group because, as in your comment, it's irrelevant for the tested strings):

"(([^\\"]*))*" 

([^\\"]*) matches any string except quotes or backslashes. This again is enclosed in an optional group that can repeat any number of times.

Now for the string "ABC, the regex engine needs to try the following permutations:

  • ", ABC
  • ", ABC, <empty string>
  • ", AB, C
  • ", AB, C, <empty string>
  • ", AB, <empty string>, C
  • ", AB, <empty string>, C, <empty string>
  • ", <empty string>, AB, C
  • ", <empty string>, AB, C, <empty string>
  • ", <empty string>, AB, <empty string>, C, <empty string>
  • ", <empty string>, AB, <empty string>, C
  • ", A, BC
  • ", A, BC, <empty string>
  • ", A, <empty string>, BC
  • ", <empty string>, A, BC
  • etc.
  • ", A, B, C
  • ", A, B, C, <empty string>
  • ", A, B, <empty string>, C
  • etc. etc.

each of which then fails because there is no following ".

Also, you're only testing for substrings instead of forcing the regex to match the entire string. And you usually want to use verbatim strings for regexes to cut down on the number of backslashes you need. How about this:

foundMatch = Regex.IsMatch(subjectString, 
    @"\A     # Start of the string
    ""       # Match a quote
    (?:      # Either match...
     \\.     # an escaped character
    |        # or
     [^\\""] # any character except backslash or quote
    )*       # any number of times
    ""       # Match a quote
    \Z       # End of the string", 
    RegexOptions.IgnorePatternWhitespace);
like image 129
Tim Pietzcker Avatar answered Sep 17 '22 17:09

Tim Pietzcker


EDIT

Here you go: "\"([^\\\\\"]|\\\\.)*\""

To explain, after C# has escaped the string you get this regex: "([^\\"]|\\.)*"

Meaning:

"                #start with a quote
(
    [^\\"]       #match a non backslash or quote
    |            #or
    \\.          #backslash something
)                
*                #And repeat
"                #end with a quote

By not nesting your * you don't get the exponential or infinite loop, and it returns instantly for me.

like image 20
Buh Buh Avatar answered Sep 16 '22 17:09

Buh Buh