substring match faster with regular expression?

Tags:

After having read up on RE/NFA and DFA, it seems that finding a substring within a string might actually be asymptotically faster using an RE rather than a brute force O(mn) find. My reasoning is that a DFA would actually maintain state and avoid processing each character in the "haystack" more than once. Hence, searches in long strings may actually be much faster if done with regular expressions.

Of course, this is valid only for RE matchers that convert from NFA to DFA.

Has anyone experienced better string match performance in real life when using RE rather than a brute force matcher?

280

asked Jul 21 '10 20:07

dhruvbird

3 Answers

First of all, I would recommend you read the article about internals of regular expressions in several languages: Regular Expression Matching Can Be Simple And Fast.

Because regexps in many languages are not just for matching, but also provide possibility of group-capturing and back-referencing, almost all implementations use so called "backtracking" when execute NFA built from the given regexp. And this implementation has exponential time complexity (in worst case).

There could be RE implementation through the DFA (with group capturing), but it has an overhead (see Laurikari's paper NFAs with Tagged Transitions, their Conversion to Deterministic Automata and Application to Regular Expressions).

For simple substring searching you could use Knuth-Morris-Pratt algorithm, which build DFA to search substring, and it has optimal O(len(s)) complexity. But it hase overhead also, and if you test naive approach (O(nm)) against this optimal algorithm on real-world words and phrases (which are not so repetitive), you could find that naive approach is better in average.

For exact substring searching you could also try Boyer–Moore algo, which has O(mn) worst-case complexity, but work better than KMP in average on real-world data.

172

answered Oct 05 '22 10:10

Dmitry

Most regular expressions used in practice are PCRE (Perl-Compatible Regular Expressions), which are wider than regular language and thus cannot be expressed with a regular grammar. PCRE has things like positive/negative lookahead/lookbehind assertions and even recursion, so parsing may require processing some characters more than once. Surely, it all comes down to particular RE implementation: whether it is optimized if the expressions stays within bounds of regular grammar or not.

Personally, I haven't done any sort of performance comparisons between the two. However, in my experience I never ever had performance issues with brute force find-and-replace, while I had to deal with RE performance bottlenecks on more than one occasion.

answered Oct 05 '22 09:10

buru

If you look at documentation for most languages it will mention that if you dont need to power of regex you should use the non-regex version for performance reasons... Example: http://www.php.net/manual/en/function.preg-split.php states: "If you don't need the power of regular expressions, you can choose faster (albeit simpler) alternatives like explode() or str_split()."

This is a trade off that exists everywhere. That is the more flexible and feature rich a solution is the poorer its performance.

answered Oct 05 '22 08:10

Parris

Related questions
                            
                                RegEx Tokenizer: split text into words, digits, punctuation, and spacing (do not delete anything)
                            
                                Scanning for Unicode Numbers in a string with \d
                            
                                Same character 3 times condition in mysql
                            
                                How to convert a measurement displayed in an architectural format to a floating point?
                            
                                Why the exclamation point in another StackOverflow posting involving telephone number REGEX?
                            
                                Java Scanner vs Matcher - Regular Expressions, Matcher works, Scanner doesn't
                            
                                Replace "1Don't do that" with "_1Don_t_do_that"
                            
                                What is the best way to match substring from a big string to a huge list of keywords
                            
                                Programming error leads to inexplanable regex
                            
                                Replace all character matches that are not escaped with backslash
                            
                                Case and diacritic insensitive matching of regex with metacharacter in Swift
                            
                                Get spamassassin to drop emails containing a specific REGEX in attached filenames
                            
                                I would like explanation of the behaviour of Perl's regular expression engine
                            
                                how do you parse a sysmon file to extract certain information using R?
                            
                                Regex for matching indented continuation lines
                            
                                Parse CSV with empty fields, escaped quotes and commas with awk
                            
                                What is the Perl equivalent of PCRE's PCRE_PARTIAL?
                            
                                Extract named group regex pattern from a compiled regex in Python
                            
                                Move emails where the subject matches a particular RegEx
                            
                                How can one turn regular quotes (i.e. ', ") into LaTeX/TeX quotes (i.e. `', ``'')

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

substring match faster with regular expression?

Tags:

string

regex

regular-language