What's a pathological regex that blows up many parsers (both in time & memory)? and which parsers? Bonus points the more basic and standard the regex is, and the more likely that a non-malicious user might innocently come up with it. Feel free to post actual time and memory data, and parser version.
(I seem to remember that excessive lookbehind assertions or (EDIT:)backtracking in PERL are said to do this, or at least used to be. Anything else?)
Being more specific with your regular expressions, even if they become much longer, can make a world of difference in performance. The fewer characters you scan to determine the match, the faster your regexes will be.
Regular expressions are dense. This makes them hard to read, but not in proportion to the information they carry. Certainly 100 characters of regular expression syntax is harder to read than 100 consecutive characters of ordinary prose or 100 characters of C code.
Catastrophic backtracking is a condition that can occur if you're checking a (usually long) string against a complex regular expression. The problem usually occurs if something towards the end of the string causes the string to not match.
Adapted from the first example in the article Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...):
perl -e '$n=29; ("a" x $n) =~ (("a?" x $n).("a" x $n))'
Which takes 40+ seconds on my system. Then do $n++
for exponentially increasing fun...
From Russ Cox's excellent article: $ perl -e '("a" x 100000) =~ /^(ab?)*$/;'
. This apparently causes a segfault. There are more in the article.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With