Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why are C# compiled regular expressions faster than equivalent string methods?

Every time I have to do simple containment or replacement operations on strings, where the term that I'm searching for is a fixed value, I find that if I take my sample input and do some profiling on it, using a compiled regular expression is nearly* always faster than using the equivalent method from the String class.

I've tried comparing a variety of methods ( hs is the "haystack" to search, ndl is the "needle" to search for, repl is the replacement value. regex is always created with the RegexOptions.Compiled option ):

  • hs.Replace( ndl, repl ) vs regex.Replace( hs, repl )
  • hs.Contains( ndl ) vs regex.IsMatch( hs )

I've found quite a few discussions focusing on which of the two techniques are faster (1, 2, 3, and loads of others), but those discussions always seem to focus on:

  1. Use the string version for simple operations and regex for complex operations (which, from a raw performance perspective, doesn't even seem to be necessarily a good idea), or
  2. Run a test and compare the two ( and for equivalent tests, the regex version seems to always perform better ).

I don't understand how this can possibly be the case: how does the regex engine compare any two strings for substring matches faster than the equivalent string version? This seems to hold true for search spaces that are very small or very large, or search terms that are small or large, or whether the search term occurs early or late in the search space.

So, why are regular expressions faster?


* In fact, the only case I've managed to show that the string version is faster than a compiled regex is when searching an empty string! Any other case, from single character strings to very long strings are processed faster by a compiled regex than the equivalent string method.


Update: Added a clause to clarify that I'm looking at cases where the search term is known at compile time. For dynamic or one-time operations, the overhead of compiling the regular expression will tend to skew the results in favor of the string methods.

like image 927
Chris Phillips Avatar asked Sep 14 '12 16:09

Chris Phillips


People also ask

Why is C not A or B?

Because C comes after B The reason why the language was named “C” by its creator was that it came after B language. Back then, Bell Labs already had a programming language called “B” at their disposal.

Why do we use C?

C is a general-purpose programming language and can efficiently work on enterprise applications, games, graphics, and applications requiring calculations, etc. C language has a rich library which provides a number of built-in functions. It also offers dynamic memory allocation.

Why are we still using C?

The C programming language doesn't seem to have an expiration date. It's closeness to the hardware, great portability and deterministic usage of resources makes it ideal for low level development for such things as operating system kernels and embedded software.

Why C is the best language?

The programs that you write in C compile and execute much faster than those written in other languages. This is because it does not have garbage collection and other such additional processing overheads. Hence, the language is faster as compared to most other programming languages.


2 Answers

I don't understand how this can possibly be the case: how does the regex engine compare any two strings for substring matches faster than the equivalent string version?

I can think of two reasons:

  1. The regex is using some smart algorithm like Boyer Moore (O(M/N)) while the simple string operation simply compares the needle to each position in the haystack (O(N*M)).
  2. They're not really doing the same thing. For example, one might do culture-invariant matching while the other does culture-dependent matching, which might make a performance difference.
like image 151
Niki Avatar answered Oct 18 '22 11:10

Niki


As the Base Class Library team wrote:

In [the case of RegexOptions.Compiled], we first do the work to parse into opcodes. Then we also do more work to turn those opcodes into actual IL using Reflection.Emit. As you can imagine, this mode trades increased startup time for quicker runtime: in practice, compilation takes about an order of magnitude longer to startup, but yields 30% better runtime performance.

But, you're overloking one important thing: Pattern is fixed. Be aware that this isn't always the case. You can't change it at runtime! There will be cases in which flexibility will go down for more than the 30% of the performance gain.

like image 44
Erre Efe Avatar answered Oct 18 '22 13:10

Erre Efe