Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why regex does not care about string length

Why input length on regex does not affect performance and how is that possible?

The generated string is like this: 128 random characters. then two numbers inside parenthesis. and this is repeated many times.

128 radnom characters....(-2435346|45436) 128 radnom characters....(-32525562|-325346)

The regex gets all the numbers inside parenthesis. here is the pattern.

\(([-+]?\d+\|[-+]?\d+)\)

So the matches would be like

-2435346|45436
-32525562|-325346
etc...

Here is the code that does the benchmark. I start the stop watch after input is generated because i want to only evaluate Matching time.

Random rand = new Random();
Func<string> getRandString = // generates 128 random characters.
    () => Enumerable.Range(0, 128).Select(x => (char) rand.Next()).Aggregate("", (a, b) => a + b);
Func<int> getRandInteger = () => rand.Next(); // generates random number.
string format = "{0}({1}|{2})";

// Generate the big string.
StringBuilder bigstr = new StringBuilder();
for (int i = 0; i < 100; i++) // repeat 100 times.
{
    bigstr.Append(string.Format(format, getRandString(), getRandInteger(), getRandInteger()));
}
string input = bigstr.ToString();

Stopwatch stopwatch = Stopwatch.StartNew();
var matches = Regex.Matches(input, @"\(([-+]?\d+\|[-+]?\d+)\)");
stopwatch.Stop();
Console.WriteLine("Time Elapsed :\t{0}\nInputLength :\t{1}\nMatches Count :\t{2}", stopwatch.Elapsed, input.Length, matches.Count);

This is the output in my Console if I repeat loop 10 times.

Time Elapsed :   00:00:00.0004132
InputLength :    1500
Matches Count :  10

If I repeat loop 1000 times.

Time Elapsed :   00:00:00.0004373 // seriously?
InputLength :    149937
Matches Count :  1000

If I repeat loop 1000000 times.

Time Elapsed :   00:00:00.0004900 // wtf?
InputLength :    149964452
Matches Count :  1000000

Screenshot if you dont believe

enter image description here

Is this some kind of lazy evaluation? if so then how it can show the count of matches? how ever I did this under debugger and I could see the matches.

Is there something especial in my regex pattern that makes it fast? but how string length does not affect performance? I cant understand.

like image 533
M.kazem Akhgary Avatar asked Nov 08 '15 11:11

M.kazem Akhgary


1 Answers

One of the reasons this happens is that matches.Count property is evaluated lazily. When you stop your stopwatch, the matcher is prepared to do matching, but it has not found anything. Only when you call matches.Count does it start working, but you have stopped the timing by then.

A simple solution is to move matches.Count call into the portion of the code that you time.

Another reason is that you are including the time it takes to parse regex into account. This is a relatively expensive operation, so you should do it before turning on the stopwatch for better timing:

Regex testRegex = new Regex(@"\(([-+]?\d+\|[-+]?\d+)\)");
Stopwatch stopwatch = Stopwatch.StartNew();
var matches = testRegex.Matches(input);
var matchesCount = matches.Count;
stopwatch.Stop();
Console.WriteLine("Time Elapsed :\t{0}\nInputLength :\t{1}\nMatches Count :\t{2}", stopwatch.Elapsed, input.Length, matchesCount);

Now the growth of time for 10 matches vs. 10,000 matches becomes considerable (00:00:00.0129844 vs. 00:00:00.0843982), though not as big as one might expect for a thousand-time difference in input length.

like image 131
Sergey Kalinichenko Avatar answered Sep 21 '22 02:09

Sergey Kalinichenko