Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

\d less efficient than [0-9]

I made a comment yesterday on an answer where someone had used [0123456789] in a regex rather than [0-9] or \d. I said it was probably more efficient to use a range or digit specifier than a character set.

I decided to test that out today and found out to my surprise that (in the c# regex engine at least) \d appears to be less efficient than either of the other two which don't seem to differ much. Here is my test output over 10000 random strings of 1000 random characters with 5077 actually containing a digit:

Regex \d           took 00:00:00.2141226 result: 5077/10000 Regex [0-9]        took 00:00:00.1357972 result: 5077/10000  63.42 % of first Regex [0123456789] took 00:00:00.1388997 result: 5077/10000  64.87 % of first 

It's a surprise to me for two reasons, that I would be interested if anyone can shed some light on:

  1. I would have thought the range would be implemented much more efficiently than the set.
  2. I can't understand why \d is worse than [0-9]. Is there more to \d than simply shorthand for [0-9]?

Here is the test code:

using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Diagnostics; using System.Text.RegularExpressions;  namespace SO_RegexPerformance {     class Program     {         static void Main(string[] args)         {             var rand = new Random(1234);             var strings = new List<string>();             //10K random strings             for (var i = 0; i < 10000; i++)             {                 //generate random string                 var sb = new StringBuilder();                 for (var c = 0; c < 1000; c++)                 {                     //add a-z randomly                     sb.Append((char)('a' + rand.Next(26)));                 }                 //in roughly 50% of them, put a digit                 if (rand.Next(2) == 0)                 {                     //replace 1 char with a digit 0-9                     sb[rand.Next(sb.Length)] = (char)('0' + rand.Next(10));                 }                 strings.Add(sb.ToString());             }              var baseTime = testPerfomance(strings, @"\d");             Console.WriteLine();             var testTime = testPerfomance(strings, "[0-9]");             Console.WriteLine("  {0:P2} of first", testTime.TotalMilliseconds / baseTime.TotalMilliseconds);             testTime = testPerfomance(strings, "[0123456789]");             Console.WriteLine("  {0:P2} of first", testTime.TotalMilliseconds / baseTime.TotalMilliseconds);         }          private static TimeSpan testPerfomance(List<string> strings, string regex)         {             var sw = new Stopwatch();              int successes = 0;              var rex = new Regex(regex);              sw.Start();             foreach (var str in strings)             {                 if (rex.Match(str).Success)                 {                     successes++;                 }             }             sw.Stop();              Console.Write("Regex {0,-12} took {1} result: {2}/{3}", regex, sw.Elapsed, successes, strings.Count);              return sw.Elapsed;         }     } } 
like image 805
weston Avatar asked May 18 '13 07:05

weston


2 Answers

\d checks all Unicode digits, while [0-9] is limited to these 10 characters. For example, Persian digits, ۱۲۳۴۵۶۷۸۹, are an example of Unicode digits which are matched with \d, but not [0-9].

You can generate a list of all such characters using the following code:

var sb = new StringBuilder(); for(UInt16 i = 0; i < UInt16.MaxValue; i++) {     string str = Convert.ToChar(i).ToString();     if (Regex.IsMatch(str, @"\d"))         sb.Append(str); } Console.WriteLine(sb.ToString()); 

Which generates:

0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९০১২৩৪৫৬৭৮৯੦੧੨੩੪੫੬੭੮੯૦૧૨૩૪૫૬૭૮૯୦୧୨୩୪୫୬୭୮୯௦௧௨௩௪௫௬௭௮௯౦౧౨౩౪౫౬౭౮౯೦೧೨೩೪೫೬೭೮೯൦൧൨൩൪൫൬൭൮൯๐๑๒๓๔๕๖๗๘๙໐໑໒໓໔໕໖໗໘໙༠༡༢༣༤༥༦༧༨༩၀၁၂၃၄၅၆၇၈၉႐႑႒႓႔႕႖႗႘႙០១២៣៤៥៦៧៨៩᠐᠑᠒᠓᠔᠕᠖᠗᠘᠙᥆᥇᥈᥉᥊᥋᥌᥍᥎᥏᧐᧑᧒᧓᧔᧕᧖᧗᧘᧙᭐᭑᭒᭓᭔᭕᭖᭗᭘᭙᮰᮱᮲᮳᮴᮵᮶᮷᮸᮹᱀᱁᱂᱃᱄᱅᱆᱇᱈᱉᱐᱑᱒᱓᱔᱕᱖᱗᱘᱙꘠꘡꘢꘣꘤꘥꘦꘧꘨꘩꣐꣑꣒꣓꣔꣕꣖꣗꣘꣙꤀꤁꤂꤃꤄꤅꤆꤇꤈꤉꩐꩑꩒꩓꩔꩕꩖꩗꩘꩙0123456789

like image 83
Sina Iravanian Avatar answered Sep 21 '22 08:09

Sina Iravanian


Credit to ByteBlast for noticing this in the docs. Just changing the regex constructor:

var rex = new Regex(regex, RegexOptions.ECMAScript); 

Gives new timings:

Regex \d           took 00:00:00.1355787 result: 5077/10000 Regex [0-9]        took 00:00:00.1360403 result: 5077/10000  100.34 % of first Regex [0123456789] took 00:00:00.1362112 result: 5077/10000  100.47 % of first 
like image 22
weston Avatar answered Sep 22 '22 08:09

weston