After having read up on RE/NFA and DFA, it seems that finding a substring within a string might actually be asymptotically faster using an RE rather than a brute force O(mn) find. My reasoning is that a DFA would actually maintain state and avoid processing each character in the "haystack" more than once. Hence, searches in long strings may actually be much faster if done with regular expressions.
Of course, this is valid only for RE matchers that convert from NFA to DFA.
Has anyone experienced better string match performance in real life when using RE rather than a brute force matcher?
String operations will always be faster than regular expression operations. Unless, of course, you write the string operations in an inefficient way. Regular expressions have to be parsed, and code generated to perform the operation using string operations.
Expose Literal Characters Regex engines match fastest when anchors and literal characters are right there in the main pattern, rather than buried in sub-expressions. Hence the advice to "expose" literal characters whenever you can take them out of an alternation or quantified expression. Let's look at two examples.
Regex will work faster in execution, however Regex's compile time and setup time will be more in instance creation. But if you keep your regex object ready in the beginning, reusing same regex to do split will be faster. String.
My experience shows that most of the time developers focus on correctness of a regex, leaving aside its performance. Yet matching a string with a regex can be surprisingly slow. So slow it can even stop any JS app or take 100% of a server CPU time causing denial of service (DOS).
First of all, I would recommend you read the article about internals of regular expressions in several languages: Regular Expression Matching Can Be Simple And Fast.
Because regexps in many languages are not just for matching, but also provide possibility of group-capturing and back-referencing, almost all implementations use so called "backtracking" when execute NFA built from the given regexp. And this implementation has exponential time complexity (in worst case).
There could be RE implementation through the DFA (with group capturing), but it has an overhead (see Laurikari's paper NFAs with Tagged Transitions, their Conversion to Deterministic Automata and Application to Regular Expressions).
For simple substring searching you could use Knuth-Morris-Pratt algorithm, which build DFA to search substring, and it has optimal O(len(s)) complexity. But it hase overhead also, and if you test naive approach (O(nm)) against this optimal algorithm on real-world words and phrases (which are not so repetitive), you could find that naive approach is better in average.
For exact substring searching you could also try Boyer–Moore algo, which has O(mn) worst-case complexity, but work better than KMP in average on real-world data.
Most regular expressions used in practice are PCRE (Perl-Compatible Regular Expressions), which are wider than regular language and thus cannot be expressed with a regular grammar. PCRE has things like positive/negative lookahead/lookbehind assertions and even recursion, so parsing may require processing some characters more than once. Surely, it all comes down to particular RE implementation: whether it is optimized if the expressions stays within bounds of regular grammar or not.
Personally, I haven't done any sort of performance comparisons between the two. However, in my experience I never ever had performance issues with brute force find-and-replace, while I had to deal with RE performance bottlenecks on more than one occasion.
If you look at documentation for most languages it will mention that if you dont need to power of regex you should use the non-regex version for performance reasons... Example: http://www.php.net/manual/en/function.preg-split.php states: "If you don't need the power of regular expressions, you can choose faster (albeit simpler) alternatives like explode() or str_split()."
This is a trade off that exists everywhere. That is the more flexible and feature rich a solution is the poorer its performance.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With