Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C# Regex Performance very slow

Tags:

c#

regex

I am very new in regex topic. I want to parse log files with following regex:

(?<time>(.*?))[|](?<placeholder4>(.*?))[|](?<source>(.*?))[|](?<level>[1-3])[|](?<message>(.*?))[|][|][|](?<placeholder1>(.*?))[|][|](?<placeholder2>(.*?))[|](?<placeholder3>(.*))

A log line looks like this:

2001.07.13 09:40:20|1|SomeSection|3|====== Some log message::Type: test=sdfsdf|||.\SomeFile.cpp||60|-1

A log file with appr. 3000 lines takes very long to parse it. Do you have some hints to speed up the performance? Thank you...

Update: I use regex because I use different log files which do not have the same structure and I use it that way:

string[] fileContent = File.ReadAllLines(filePath);
Regex pattern = new Regex(LogFormat.GetLineRegex(logFileFormat));

foreach (var line in fileContent)
{
   // Split log line
   Match match = pattern.Match(line);

   string logDate = match.Groups["time"].Value.Trim();
   string logLevel = match.Groups["level"].Value.Trim();
   // And so on...
}

Solution:
Thank you for help. I've tested it with following results:

1.) Only added RegexOptions.Compiled:
From 00:01:10.9611143 to 00:00:38.8928387

2.) Used Thomas Ayoub regex
From 00:00:38.8928387 to 00:00:06.3839097

3.) Used Wiktor Stribiżew regex
From 00:00:06.3839097 to 00:00:03.2150095

like image 775
opitzh Avatar asked Sep 13 '16 12:09

opitzh


People also ask

What C is used for?

C programming language is a machine-independent programming language that is mainly used to create many types of applications and operating systems such as Windows, and other complicated programs such as the Oracle database, Git, Python interpreter, and games and is considered a programming foundation in the process of ...

What is the full name of C?

In the real sense it has no meaning or full form. It was developed by Dennis Ritchie and Ken Thompson at AT&T bell Lab. First, they used to call it as B language then later they made some improvement into it and renamed it as C and its superscript as C++ which was invented by Dr.

What is C in C language?

What is C? C is a general-purpose programming language created by Dennis Ritchie at the Bell Laboratories in 1972. It is a very popular language, despite being old. C is strongly associated with UNIX, as it was developed to write the UNIX operating system.

Is C language easy?

C is a general-purpose language that most programmers learn before moving on to more complex languages. From Unix and Windows to Tic Tac Toe and Photoshop, several of the most commonly used applications today have been built on C. It is easy to learn because: A simple syntax with only 32 keywords.


1 Answers

Let me "convert" my comment into an answer since now I see what you can do about the regex performance.

As I have mentioned above, replace all .*? with [^|]*, and also all repeating [|][|][|] with [|]{3} (or similar, depending on the number of [|]. Also, do not use nested capturing groups, that also influences performance!

var logFileFormat = @"(?<time>[^|]*)[|](?<placeholder4>[^|]*)[|](?<source>[^|]*)[|](?<level>[1-3])[|](?<message>[^|]*)[|]{3}(?<placeholder1>[^|]*)[|]{2}(?<placeholder2>[^|]*)[|](?<placeholder3>.*)";

Only the last .* can remain "wildcardish" since it will grab the rest of the line.

Here is a comparison of your and my regex patterns at RegexHero.

enter image description here

Then, use RegexOptions.Compiled:

Regex pattern = new Regex(LogFormat.GetLineRegex(logFileFormat), RegexOptions.Compiled);
like image 187
Wiktor Stribiżew Avatar answered Oct 01 '22 20:10

Wiktor Stribiżew