Why input length on regex does not affect performance and how is that possible?
The generated string is like this: 128 random characters. then two numbers inside parenthesis. and this is repeated many times.
128 radnom characters....(-2435346|45436) 128 radnom characters....(-32525562|-325346)
The regex gets all the numbers inside parenthesis. here is the pattern.
\(([-+]?\d+\|[-+]?\d+)\)
So the matches would be like
-2435346|45436
-32525562|-325346
etc...
Here is the code that does the benchmark. I start the stop watch after input is generated because i want to only evaluate Matching time.
Random rand = new Random();
Func<string> getRandString = // generates 128 random characters.
() => Enumerable.Range(0, 128).Select(x => (char) rand.Next()).Aggregate("", (a, b) => a + b);
Func<int> getRandInteger = () => rand.Next(); // generates random number.
string format = "{0}({1}|{2})";
// Generate the big string.
StringBuilder bigstr = new StringBuilder();
for (int i = 0; i < 100; i++) // repeat 100 times.
{
bigstr.Append(string.Format(format, getRandString(), getRandInteger(), getRandInteger()));
}
string input = bigstr.ToString();
Stopwatch stopwatch = Stopwatch.StartNew();
var matches = Regex.Matches(input, @"\(([-+]?\d+\|[-+]?\d+)\)");
stopwatch.Stop();
Console.WriteLine("Time Elapsed :\t{0}\nInputLength :\t{1}\nMatches Count :\t{2}", stopwatch.Elapsed, input.Length, matches.Count);
This is the output in my Console if I repeat loop 10
times.
Time Elapsed : 00:00:00.0004132
InputLength : 1500
Matches Count : 10
If I repeat loop 1000
times.
Time Elapsed : 00:00:00.0004373 // seriously?
InputLength : 149937
Matches Count : 1000
If I repeat loop 1000000
times.
Time Elapsed : 00:00:00.0004900 // wtf?
InputLength : 149964452
Matches Count : 1000000
Screenshot if you dont believe
Is this some kind of lazy evaluation? if so then how it can show the count of matches? how ever I did this under debugger and I could see the matches.
Is there something especial in my regex pattern that makes it fast? but how string length does not affect performance? I cant understand.
One of the reasons this happens is that matches.Count
property is evaluated lazily. When you stop your stopwatch, the matcher is prepared to do matching, but it has not found anything. Only when you call matches.Count
does it start working, but you have stopped the timing by then.
A simple solution is to move matches.Count
call into the portion of the code that you time.
Another reason is that you are including the time it takes to parse regex into account. This is a relatively expensive operation, so you should do it before turning on the stopwatch for better timing:
Regex testRegex = new Regex(@"\(([-+]?\d+\|[-+]?\d+)\)");
Stopwatch stopwatch = Stopwatch.StartNew();
var matches = testRegex.Matches(input);
var matchesCount = matches.Count;
stopwatch.Stop();
Console.WriteLine("Time Elapsed :\t{0}\nInputLength :\t{1}\nMatches Count :\t{2}", stopwatch.Elapsed, input.Length, matchesCount);
Now the growth of time for 10 matches vs. 10,000 matches becomes considerable (00:00:00.0129844
vs. 00:00:00.0843982
), though not as big as one might expect for a thousand-time difference in input length.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With