Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C# Regex performance pure relative JS

I had a good experience at the speed of regex in JS.

And I decided to make a small comparison. I ran the following code:

var str = "A regular expression is a pattern that the regular expression engine attempts to match in input text.";

var re = new RegExp("t", "g");

console.time();

for(var i = 0; i < 10e6; i++)
   str.replace(re, "1");

console.timeEnd();

The result: 3888.731ms.

Now in C#:

var stopwatch = new Stopwatch();

var str = "A regular expression is a pattern that the regular expression engine attempts to match in input text.";

var re = new Regex("t", RegexOptions.Compiled);

stopwatch.Start();

for (int i = 0; i < 10e6; i++)
    re.Replace(str, "1");

stopwatch.Stop();

Console.WriteLine( stopwatch.Elapsed.TotalMilliseconds);

Result: 32798.8756ms !!

Now, I tried re.exec(str); vs Regex.Match(str, "t");: 1205.791ms VS 7352.532ms in favor of JS.

Is massive text processing "Not suitable" subject to be done in .net?

UPDATE 1 same test with [ta] pattern (instead t literal):

3336.063ms in js VS 64534.4766!!! in c#.

another example:

console.time();

var str = "A regular expression is a pattern that the regular expression engine attempts 123 to match in input text.";


var re = new RegExp("\\d+", "g");
var result;
for(var i = 0; i < 10e6; i++)
    result = str.replace(str, "$0");
   

console.timeEnd();

3350.230ms in js, vs 32582.405ms in c#.

like image 319
dovid Avatar asked Dec 16 '17 18:12

dovid


2 Answers

String in C# is a dangerous beast and you really can shoot yourself in the foot if you use it carelessly, but I don't think given test is representative enough to warrant any generalizations.

First, I did reproduce similar performance for your test case. Adding RegexOptions.Compiled reduced the required time to 30-ish seconds, but this is still significant difference.

The specific test case is probably not a too realistic one, as who would use regex for single char replace? Should you use a dedicated API for this task, you would get comparable results str.Replace('t', '1'); was 1600ms on my machine.

This means for this specific task C# performance is comparable to JS. Whether the C# Regex.Replace() is internally somehow not suitable for single-char replaces or if JS regex version is optimizing the regex away - some JS guru should answer that.

Would a more realistic complex regex have a notable difference - would be interesting to know.

Edit: I verified that the performance gap remains when the replace results are actually used and when input strings differ in each run (10s vs 35s in my tests). So gap is less, but still there.

Possible reasons

According to hints from this SO question browser implementations delegate some string operations to optimized c++ code. If they do this for string concat, they probably do that for Regex as well. AFAIK, C# Regex ans String classes stay in managed world and that brings some baggage.

like image 58
Imre Pühvel Avatar answered Nov 12 '22 14:11

Imre Pühvel


One of the reasons for the big difference between JS regex and .NET regex is that JS lacks quite a number of advanced features, however .NET is very feature-rich.

Here's two quotes from regular-expressions.info:

JavaScript:

JavaScript implements Perl-style regular expressions. However, it lacks quite a number of advanced features available in Perl and other modern regular expression flavors:

No \A or \Z anchors to match the start or end of the string. Use a caret or dollar instead.

No atomic grouping or possessive quantifiers.

No Unicode support, except for matching single characters with \uFFFF.

No named capturing groups. Use numbered capturing groups instead.

No mode modifiers to set matching options within the regular expression.

No conditionals.

No regular expression comments. Describe your regular expression with JavaScript // comments instead, outside the regular expression string.

.NET Framework:

The Microsoft .NET Framework, which you can use with any .NET programming language such as C# (C sharp) or Visual Basic.NET, has solid support for regular expressions. .NET's regex flavor is very feature-rich. The only noteworthy feature that's lacking are possessive quantifiers.

like image 40
codeDom Avatar answered Nov 12 '22 13:11

codeDom