Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex.match in C# performance problem

Tags:

c#

regex

Hi all i use Regex.match in C# to phase a text file line by line. I find it will spend more time(about 2-4 secs) when the line cant match the patten. But spend less time(less than 1 sec) when match. Who can tell me how can i improve the performance?

This is the regex I'm using:

^.*?\t.*?\t(?<npk>\d+)\t(?<bol>\w+)\t.*?\t.*?\t.*?\t.*?\t.*?\t.*?\t.*?\t.*?\t.*?\t\s*(?<netValue>[\d\.,]+)\t.*?\t.*?\t(?<item>\d{6})\t(?<salesDoc>\d+)\t(?<acGiDate>[\d\.]{10})\t.*?\t.*?\t.*?\t.*?\t.*?\t(?<delivery>\d+)\t\s*(?<billQuantity>\d+)\t.*?\t(?<material>[\w\-]+)\tIV$
like image 372
user622851 Avatar asked Dec 16 '22 17:12

user622851


2 Answers

Performance problems that only show up when a regex can't match are very often due to catastrophic backtracking. This happens when a regex allows a lot of possible combinations to match the subject text, all of which have to be tried by the regex engine until it may declare failure.

In your case, the reason for failure is obvious:

First, what you're doing shouldn't really be done with a regex, but rather with a CSV parser (or TSV parser in your case).

If you're stuck with regex, though, you still need to change something. Your problem is that the delimiter \t can also be matched by the dot (.), so unless the entire string matches, the regex engine has to try oodles of permutations, like I outlined above.

A big step forward, therefore, would be to change all the .*? into [^\t]* where applicable, and to condense repeats using the {m,n} operator:

^(?:[^\t]*\t){2}(?<npk>\d+)\t(?<bol>\w+)(?:\t[^\t]*){9}\t\s*(?<netValue>[\d\.,]+)(?:\t[^\t]*){2}\t(?<item>\d{6})\t(?<salesDoc>\d+)\t(?<acGiDate>[\d\.]{10})(?:\t[^\t]*){5}\t(?<delivery>\d+)\t\s*(?<billQuantity>\d+)\t[^\t]*\t(?<material>[\w\-]+)\tIV$

I hope I haven't miscounted :)


Just for illustration:

Matching this text

1   2   3   4   5   6   7   8   9   0

with this excerpt from your regex above

.*?\t.*?\t.*?\t.*?\t.*?\t.*?\t.*?\t.*?\t.*?\t\s*(?<netValue>[\d\.,]+)

takes the regex engine 39 steps.

When you feed it this text, though:

1   2   3   4   5   6   7   8   9   X

it takes the regex engine 4602 steps to figure out that it can't match.

If you use

(?:[^\t]*\t){9}\s*(?<netValue>[\d\.,]+)

instead, the engine needs 30 steps for a successful match and only 39 for a failed attempt.

like image 177
Tim Pietzcker Avatar answered Jan 03 '23 14:01

Tim Pietzcker


Pre-compiling it usually helps:

private static readonly Regex re = new Regex(pattern, RegexOptions.Compiled);

however, I wonder if in this specific case it is tied to the regex - maybe some expensive back-reference. Regex isn't always the tool to use, for example...


Edit now that this appears to be delimited data:

Instead of regex, delimited data can use a parsing approach much more effectively. You might even get away with just var parts=line.Split('\t') (and access parts by index), but if that fails - this csv reader has options to control the delimiters etc.

like image 31
Marc Gravell Avatar answered Jan 03 '23 15:01

Marc Gravell