The code below contains a regular expression designed to extract a C# string literal but the performance of the regex matching for input strings of more than a few characters is woeful.
class Program
{
private static void StringMatch(string s)
{
// regex: quote, zero-or-more-(zero-or-more-non-backslash-quote, optional-backslash-anychar), quote
Match m = Regex.Match(s, "\"(([^\\\\\"]*)(\\\\.)?)*\"");
if (m.Success)
Trace.WriteLine(m.Value);
else
Trace.WriteLine("no match");
}
public static void Main()
{
// this first string is unterminated (so the match fails), but it returns instantly
StringMatch("\"OK");
// this string is terminated (the match succeeds)
StringMatch("\"This is a longer terminated string - it matches and returns instantly\"");
// this string is unterminated (so the match will fail), but it never returns
StringMatch("\"This is another unterminated string and takes FOREVER to match");
}
}
I can refactor the regex into a different form, but can anyone offer an explanation why the performance is so bad?
My experience shows that most of the time developers focus on correctness of a regex, leaving aside its performance. Yet matching a string with a regex can be surprisingly slow. So slow it can even stop any JS app or take 100% of a server CPU time causing denial of service (DOS).
String operations will always be faster than regular expression operations. Unless, of course, you write the string operations in an inefficient way. Regular expressions have to be parsed, and code generated to perform the operation using string operations.
Regex has an interpreted mode and a compiled mode. The compiled mode takes longer to start, but is generally faster. Some users want both startup time and performance; other users want to run on a platform where JITted code is not allowed, and also want performance.
Regular expression matching can be simple and fast, using finite automata-based techniques that have been known for decades. In contrast, Perl, PCRE, Python, Ruby, Java, and many other languages have regular expression implementations based on recursive backtracking that are simple but can be excruciatingly slow.
You're running into catastrophic backtracking:
Let's simplify the regex a bit (without the escaped quotes and without the second optional group because, as in your comment, it's irrelevant for the tested strings):
"(([^\\"]*))*"
([^\\"]*)
matches any string except quotes or backslashes. This again is enclosed in an optional group that can repeat any number of times.
Now for the string "ABC
, the regex engine needs to try the following permutations:
"
, ABC
"
, ABC
, <empty string>
"
, AB
, C
"
, AB
, C
, <empty string>
"
, AB
, <empty string>
, C
"
, AB
, <empty string>
, C
, <empty string>
"
, <empty string>
, AB
, C
"
, <empty string>
, AB
, C
, <empty string>
"
, <empty string>
, AB
, <empty string>
, C
, <empty string>
"
, <empty string>
, AB
, <empty string>
, C
"
, A
, BC
"
, A
, BC
, <empty string>
"
, A
, <empty string>
, BC
"
, <empty string>
, A
, BC
"
, A
, B
, C
"
, A
, B
, C
, <empty string>
"
, A
, B
, <empty string>
, C
each of which then fails because there is no following "
.
Also, you're only testing for substrings instead of forcing the regex to match the entire string. And you usually want to use verbatim strings for regexes to cut down on the number of backslashes you need. How about this:
foundMatch = Regex.IsMatch(subjectString,
@"\A # Start of the string
"" # Match a quote
(?: # Either match...
\\. # an escaped character
| # or
[^\\""] # any character except backslash or quote
)* # any number of times
"" # Match a quote
\Z # End of the string",
RegexOptions.IgnorePatternWhitespace);
EDIT
Here you go: "\"([^\\\\\"]|\\\\.)*\""
To explain, after C# has escaped the string you get this regex:
"([^\\"]|\\.)*"
Meaning:
" #start with a quote
(
[^\\"] #match a non backslash or quote
| #or
\\. #backslash something
)
* #And repeat
" #end with a quote
By not nesting your * you don't get the exponential or infinite loop, and it returns instantly for me.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With