Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which is the more efficient regex?

I recently received some job applicant test results in which one person claimed the the solution they gave was more efficient (I won't say which since I don't want to influence the answers). Needless to say, I was sceptical, but I don't know enough about the inner workings of RE compilers to comment intelligently.

The question was: Give a regular expression to recognise numbers from 0 through 99 inclusive.

The answers were:

[0-9]{1,2}
[0-9]?[0-9]
[0-9]|([0-9][0-9])

I'd be interested as to why any of these are faster (or better in any other way). Bonus points for providing evidence rather than conjecture, but I'll still take conjecture if you make it sound convincing enough :-)

like image 221
paxdiablo Avatar asked Nov 24 '10 23:11

paxdiablo


2 Answers

The expression [0-9]{1,2} should be the fastest I would imagine, although it will depend on the specific engine.

My reasoning is:

  • [0-9]{1,2} - this exactly describes what you want to match.
  • [0-9]?[0-9] - this could result in a backtrack if the first match fails.
  • [0-9]|([0-9][0-9]) - this requires the first character to be checked twice if it fails, and the parentheses here are unnecessary and cause an unneeded capture.

Here are the iterations per second I got when testing this in .NET (without RegexOptions.Compiled):

Regex                      100% valid input   50% valid input  100% invalid input
"^[0-9]{1,2}$"             749086             800313           870748
"^[0-9]?[0-9]$"            731951             725984           740152
"^(?:[0-9]|([0-9][0-9]))$" 564654             687248           870378

With RegexOptions.Compiled:

Regex                      100% valid input   50% valid input  100% invalid input
"^[0-9]{1,2}$"             1486212            1592535          1831843
"^[0-9]?[0-9]$"            1301557            1448812          1559193
"^(?:[0-9]|([0-9][0-9]))$" 1131179            1303213          1394146

And as a graph:

alt text

Note: I modified each regular expression to require an exact match rather than performing a search.

like image 120
Mark Byers Avatar answered Oct 11 '22 18:10

Mark Byers


At least in theory, identical regexes like these will yield identical automata. A DFA-based matcher is going to match one character at a time and have the different possible branches encoded in its states (as opposed to taking one branch at a time and then backtracking upon failure), so the performance of each will be the same.

All three regexes would be matched by this DFA:

+---+  0-9  +---+  0-9  +---+  *  .---.
| A |  -->  | B |  -->  | C | --> (ERR)
+---+       +---+       +---+     '---'
  |          / \          |
  | *     * /   \ $       | $ 
  V        /     \        V
.---.     /       \     .---.
(ERR) <--'         '--> (ACC)
'---'                   '---'

State A: Start state. Goes to B if it sees a digit, otherwise to an ERROR state.
State B: One digit matched so far. EOL ($) is ACCEPTED. A digit moves to C. Anything else is an ERROR.
State C: Two digits matched. EOL is ACCEPTED, anything else is an ERROR.

That's my language theoretical answer. I can't speak to real world regex engine implementations. I'm ignoring the capturing semantics of the parentheses as I am guessing that's not the point of the question. Automata also don't handle other "non-theoretical" constructs like greediness, lookahead, etc. At least not in their textbook presentation.

like image 23
John Kugelman Avatar answered Oct 11 '22 18:10

John Kugelman