Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is a cached Regexp outperforming a compiled one?

This is just a question to satisfy my curiosity. But to me it is interesting.

I wrote this little simple benchmark. It calls 3 variants of Regexp execution in a random order a few thousand times:

Basically, I use the same pattern but in different ways.

  1. Your ordinary way without any RegexOptions. Starting with .NET 2.0 these do not get cached. But should be "cached" because it is held in a pretty global scope and not reset.

  2. With RegexOptions.Compiled

  3. With a call to the static Regex.Match(pattern, input) which does get cached in .NET 2.0

Here is the code:

static List<string> Strings = new List<string>();        
static string pattern = ".*_([0-9]+)\\.([^\\.])$";

static Regex Rex = new Regex(pattern);
static Regex RexCompiled = new Regex(pattern, RegexOptions.Compiled);

static Random Rand = new Random(123);

static Stopwatch S1 = new Stopwatch();
static Stopwatch S2 = new Stopwatch();
static Stopwatch S3 = new Stopwatch();

static void Main()
{
  int k = 0;
  int c = 0;
  int c1 = 0;
  int c2 = 0;
  int c3 = 0;

  for (int i = 0; i < 50; i++)
  {
    Strings.Add("file_"  + Rand.Next().ToString() + ".ext");
  }
  int m = 10000;
  for (int j = 0; j < m; j++)
  {
    c = Rand.Next(1, 4);

    if (c == 1)
    {
      c1++;
      k = 0;
      S1.Start();
      foreach (var item in Strings)
      {
        var m1 = Rex.Match(item);
        if (m1.Success) { k++; };
      }
      S1.Stop();
    }
    else if (c == 2)
    {
      c2++;
      k = 0;
      S2.Start();
      foreach (var item in Strings)
      {
        var m2 = RexCompiled.Match(item);
        if (m2.Success) { k++; };
      }
      S2.Stop();
    }
    else if (c == 3)
    {
      c3++;
      k = 0;
      S3.Start();
      foreach (var item in Strings)
      {
        var m3 = Regex.Match(item, pattern);
        if (m3.Success) { k++; };
      }
      S3.Stop();
    }
  }

  Console.WriteLine("c: {0}", c1);
  Console.WriteLine("Total milliseconds: " + (S1.Elapsed.TotalMilliseconds).ToString());
  Console.WriteLine("Adjusted milliseconds: " + (S1.Elapsed.TotalMilliseconds).ToString());

  Console.WriteLine("c: {0}", c2);
  Console.WriteLine("Total milliseconds: " + (S2.Elapsed.TotalMilliseconds).ToString());
  Console.WriteLine("Adjusted milliseconds: " + (S2.Elapsed.TotalMilliseconds*((float)c2/(float)c1)).ToString());

  Console.WriteLine("c: {0}", c3);
  Console.WriteLine("Total milliseconds: " + (S3.Elapsed.TotalMilliseconds).ToString());
  Console.WriteLine("Adjusted milliseconds: " + (S3.Elapsed.TotalMilliseconds*((float)c3/(float)c1)).ToString());
}

Everytime I call it the result is along the lines of:

    Not compiled and not automatically cached:
    Total milliseconds: 6185,2704
    Adjusted milliseconds: 6185,2704

    Compiled and not automatically cached:
    Total milliseconds: 2562,2519
    Adjusted milliseconds: 2551,56949184038

    Not compiled and automatically cached:
    Total milliseconds: 2378,823
    Adjusted milliseconds: 2336,3187176891

So there you have it. Not much, but about 7-8% difference.

It is not the only mystery. I cannot explain why the first way would be that much slower because it is never re-evaluated but held in a global static variable.

By the way, this is on .Net 3.5 and Mono 2.2 which behave exactly the same. On Windows.

So, any ideas, why the compiled variant would even fall behind?

EDIT1:

After fixing the code the results now look like this:

    Not compiled and not automatically cached:
    Total milliseconds: 6456,5711
    Adjusted milliseconds: 6456,5711

    Compiled and not automatically cached:
    Total milliseconds: 2668,9028
    Adjusted milliseconds: 2657,77574842168

    Not compiled and automatically cached:
    Total milliseconds: 6637,5472
    Adjusted milliseconds: 6518,94897724836

Which pretty much obsoletes all of the other questions as well.

Thanks for the answers.

like image 763
user51710 Avatar asked Jan 09 '09 14:01

user51710


People also ask

What is a compiled regex?

Compiled Regular Expressions If a Regex object is constructed with the RegexOptions. Compiled option, it compiles the regular expression to explicit MSIL code instead of high-level regular expression internal instructions. This allows .

Does compiling regex make it faster?

I created a much simpler test that will show you that compiled regular expressions are unquestionably faster than not compiled. Here, the compiled regular expression is 35% faster than the not compiled regular expression.

What does regexp stand for?

Regular Expressions (Regex) Regular Expression, or regex or regexp in short, is extremely and amazingly powerful in searching and manipulating text strings, particularly in processing text files. One line of regex can easily replace several dozen lines of programming codes.

How does regex matching work?

A regex pattern matches a target string. The pattern is composed of a sequence of atoms. An atom is a single point within the regex pattern which it tries to match to the target string. The simplest atom is a literal, but grouping parts of the pattern to match an atom will require using ( ) as metacharacters.


2 Answers

In the Regex.Match version you are looking for the input in the pattern. Try swapping the parameters around.

var m3 = Regex.Match(pattern, item); // Wrong
var m3 = Regex.Match(item, pattern); // Correct
like image 199
Martin Brown Avatar answered Nov 10 '22 13:11

Martin Brown


I noticed similar behavior. I also wondered why the compiled version would be slower, but noticed that above a certain number of calls, the compiled version is faster. So I dug into Reflector a little, and I noticed that for a compiled Regex, there's still a little setup that is performed on first call (specifically, creating an instance of the appropriate RegexRunner object).

In my test, I found that if I moved both the constructor and an initial throw-away call to the regex outside the timer start, the compiled regex won no matter how many iterations I ran.


Incidentally, the caching that the framework is doing when using static Regex methods is an optimization that's only needed when using static Regex methods. This is because every call to a static Regex method creates a new Regex object. In the Regex class's constructor it must parse the pattern. The caching allows subsequent calls of static Regex methods to reuse the RegexTree parsed from the first call, thereby avoiding the parsing step.

When you use instance methods on a single Regex object, then this is not an issue. The parsing is still only performed one time (when you create the object). In addition, you get to avoid running all the other code in the constructor, as well as the heap allocation (and subsequent garbage collection).

Martin Brown noticed that you reversed the arguments to your static Regex call (good catch, Martin). I think you'll find that if you fix that, the instance (not-compiled) regex will beat the static calls every time. You should also find that, given my findings above, the compiled instance will beat the not-compiled one, too.

BUT: You should really read Jeff Atwood's post on compiled regexes before you go blindly applying that option to every regex you create.

like image 34
P Daddy Avatar answered Nov 10 '22 13:11

P Daddy