Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When should I use a compiled Regex vs. interpreted?

Tags:

.net

regex

After reading this article http://www.codinghorror.com/blog/archives/000228.html I understand the benefits of compiled regular expressions a little better, however in what personal scenarios would you consider mandates the use of a compiled Reg Ex?

For instance, I am using a regex in a loop and the regular expression string utilises different variables each iteration, so I would seek no improvement by flagging this regex as compiled right?


Hi thanks for your answers, my actual code is not straightforward and is compromised of an RE built on the fly so I cannot include it, so for all intensive purposes, here is an example which demonstrates my approach:
foreach (field field in fields.Where(x => x.condition))
    MatchResults = Regex.Match(request.Message, field.RegularExpression);
...
like image 732
GONeale Avatar asked Jan 06 '09 04:01

GONeale


2 Answers

In .NET, there are two ways to "compile" a regular expression. Regular expressions are always "compiled" before they can be used to find matches. When you instantiate the Regex class without the RegexOptions.Compiled flag, your regular expression is still converted into an internal data structure used by the Regex class. The actual matching process runs on that data structure rather than string representing your regex. It persists as long as your Regex instance lives.

Explicitly instantiating the Regex class is preferable to calling the static Regex methods if you're using the same regex more than once. The reason is that the static methods create a Regex instance anyway, and then throw it away. They do keep a cache of recently compiled regexes, but the cache is rather small, and the cache lookup far more costly than simply referencing a pointer to an existing Regex instance.

The above form of compilation exists in every programming language or library that uses regular expressions, though not all offer control over it.

The .NET framework provides a second way of compiling regular expressions by constructing a Regex object and specifying the RegexOptions.Compiled flag. Absence or presence of this flag does not indicate whether or not the regex is compiled. It indicates whether the regex is compiled quickly, as described above, or thoroughly, as described below.

What RegexOptions.Compiled really does is to create a new assembly with your regular expression compiled down to MSIL. This assembly is then loaded, compiled to machine code, and becomes a permanent part of your application (while it runs). This process takes a lot of CPU ticks, and the memory usage is permanent.

You should use RegexOptions.Compiled only if you're processing so much data with it that the user actually has to wait on your regex. If you can't measure the speed difference with a stopwatch, don't bother with RegexOptions.Compiled.

like image 74
Jan Goyvaerts Avatar answered Sep 20 '22 12:09

Jan Goyvaerts


I would compile the RE when it has to be used more than two or three times and the cost of compiling is more than offset by the improvements in execution time of the result.

I never compile one-off REs and I always compile those that are executed more than five times (give or take a couple) but I've never found a need for parameterized REs (that need may exist, it's just I've never found it) so that doesn't come into it.

EDIT: That article you refer to states that up-front compiling is an order of magnitude slower than interpretation (ten times) yest only saves 30%. And, in addition, interpreted REs are cached anyway. So I would say it's definitely arguing against the casual use of compiling.

A 30% saving means it would take 100/3 (about 33) executions of the compiled RE to recover the initial cost of compilation. That's according to th MSDN doco on .NET - I've always assumed in my REs (Python/Perl/Java) it wouldn't be that bad but I guess I should check.

like image 28
paxdiablo Avatar answered Sep 19 '22 12:09

paxdiablo