I recently received some job applicant test results in which one person claimed the the solution they gave was more efficient (I won't say which since I don't want to influence the answers). Needless to say, I was sceptical, but I don't know enough about the inner workings of RE compilers to comment intelligently.
The question was: Give a regular expression to recognise numbers from 0 through 99 inclusive.
The answers were:
[0-9]{1,2}
[0-9]?[0-9]
[0-9]|([0-9][0-9])
I'd be interested as to why any of these are faster (or better in any other way). Bonus points for providing evidence rather than conjecture, but I'll still take conjecture if you make it sound convincing enough :-)
The expression [0-9]{1,2}
should be the fastest I would imagine, although it will depend on the specific engine.
My reasoning is:
Here are the iterations per second I got when testing this in .NET (without RegexOptions.Compiled):
Regex 100% valid input 50% valid input 100% invalid input "^[0-9]{1,2}$" 749086 800313 870748 "^[0-9]?[0-9]$" 731951 725984 740152 "^(?:[0-9]|([0-9][0-9]))$" 564654 687248 870378
With RegexOptions.Compiled:
Regex 100% valid input 50% valid input 100% invalid input "^[0-9]{1,2}$" 1486212 1592535 1831843 "^[0-9]?[0-9]$" 1301557 1448812 1559193 "^(?:[0-9]|([0-9][0-9]))$" 1131179 1303213 1394146
And as a graph:
Note: I modified each regular expression to require an exact match rather than performing a search.
At least in theory, identical regexes like these will yield identical automata. A DFA-based matcher is going to match one character at a time and have the different possible branches encoded in its states (as opposed to taking one branch at a time and then backtracking upon failure), so the performance of each will be the same.
All three regexes would be matched by this DFA:
+---+ 0-9 +---+ 0-9 +---+ * .---.
| A | --> | B | --> | C | --> (ERR)
+---+ +---+ +---+ '---'
| / \ |
| * * / \ $ | $
V / \ V
.---. / \ .---.
(ERR) <--' '--> (ACC)
'---' '---'
State A: Start state. Goes to B if it sees a digit, otherwise to an ERROR state.
State B: One digit matched so far. EOL ($) is ACCEPTED. A digit moves to C. Anything else is an ERROR.
State C: Two digits matched. EOL is ACCEPTED, anything else is an ERROR.
That's my language theoretical answer. I can't speak to real world regex engine implementations. I'm ignoring the capturing semantics of the parentheses as I am guessing that's not the point of the question. Automata also don't handle other "non-theoretical" constructs like greediness, lookahead, etc. At least not in their textbook presentation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With