Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multithreaded use of Regex

Given the following from MSDN:

Regex objects can be created on any thread and shared between threads.

I have found that for performance, it is better NOT to share a Regex instance between threads when using the ThreadLocal class.

Please could someone explain why it runs approximately 5 times faster for a thread local instance?

Here are the results (on an 8 core machine):

   Using Regex singleton' returns 3000000 and takes 00:00:01.1005695
   Using thread local Regex' returns 3000000 and takes 00:00:00.2243880

Source Code:

using System;
using System.Linq;
using System.Threading;
using System.Text.RegularExpressions;
using System.Diagnostics;

namespace ConsoleApplication1
{
    class Program
    {
        static readonly string str = new string('a', 400);
        static readonly Regex re = new Regex("(a{200})(a{200})", RegexOptions.Compiled);

        static void Test(Func<Regex> regexGettingMethod, string methodDesciption)
        {
            Stopwatch sw = new Stopwatch();
            sw.Start();
            var sum = Enumerable.Repeat(str, 1000000).AsParallel().Select(s => regexGettingMethod().Match(s).Groups.Count).Sum();
            sw.Stop();
            Console.WriteLine("'{0}' returns {1} and takes {2}", methodDesciption, sum, sw.Elapsed);
        }

        static void Main(string[] args)
        {
            Test(() => re, "Using Regex singleton");

            var threadLocalRe = new ThreadLocal<Regex>(() => new Regex(re.ToString(), RegexOptions.Compiled));
            Test(() => threadLocalRe.Value, "Using thread local Regex");

            Console.Write("Press any key");
            Console.ReadKey();
        }
    }
}
like image 954
SergeyS Avatar asked Sep 28 '11 14:09

SergeyS


1 Answers

Positing my investigation results.

Let's ILSpy Regex. It contains a reference to RegexRunner. When Regex object is matching something it locks its RegexRunner. If there is another concurrent request to the same Regex object another temporary instance of RegexRunner gets created. RegexRunner is expensive. More threads are sharing Regex object the more chance to waste time creating temporary RegexRunners. Hope Microsoft will fix that addressing the era of massive parallelism.

Another thing: static members of Regex class taking pattern string as a parameter (like Match.IsMatch(input, pattern)) also must perform badly when the same pattern is being matched in different threads. Regex maintains a cache of RegexRunners. Two concurrent Match.IsMatch() with the same pattern will try to use the same RegexRunner and one thread will have to create temporary RegexRunner.

Thanks Will for letting me know how you handle here questions that topic-starter have found an answer for.

like image 93
SergeyS Avatar answered Oct 30 '22 02:10

SergeyS