I recently had a SonarQube rule (https://rules.sonarsource.com/java/RSPEC-4784) bring to my attention some performance issues which could be used as a denial of service against a Java regular expression implementation.
Indeed, the following Java test shows how slow the wrong regular expression can be:
import org.junit.Test;
public class RegexTest {
@Test
public void fastRegex1() {
"aaaaaaaaaaaaaaaaaaaaaaaaaaaabs".matches("(a+)b");
}
@Test
public void fastRegex2() {
"aaaaaaaaaaaaaaaaaaaaaaaaaaaab".matches("(a+)+b");
}
@Test
public void slowRegex() {
"aaaaaaaaaaaaaaaaaaaaaaaaaaaabs".matches("(a+)+b");
}
}
As you can see, the first two tests are fast, the third one is incredibly slow (in Java 8)
The same data and regex in Perl or Python, however, is not at all slow, which leads me to wonder why it is that this regular expression is so slow to evaluate in Java.
$ time perl -e '"aaaaaaaaaaaaaaaaaaaaaaaaaaaabs" =~ /(a+)+b/ && print "$1\n"'
aaaaaaaaaaaaaaaaaaaaaaaaaaaa
real 0m0.004s
user 0m0.000s
sys 0m0.004s
$ time python3 -c 'import re; m=re.search("(a+)+b","aaaaaaaaaaaaaaaaaaaaaaaaaaaabs"); print(m.group(0))'
aaaaaaaaaaaaaaaaaaaaaaaaaaaab
real 0m0.018s
user 0m0.015s
sys 0m0.004s
What is it about the extra matching modifier +
or trailing character s
in the data which makes this regular expression so slow, and why is it only specific to Java?
The reason the regex is so slow is that the "*" quantifier is greedy by default, and so the first ". *" tries to match the whole string, and after that begins to backtrack character by character. The runtime is exponential in the count of numbers on a line.
it takes about 40 micro second. No need to say when the number of string values exceeds a few thousands, it'll be too slow.
The bad regular expression took on average 10,100 milliseconds to process all 1,000,000 lines, while the good regular expression took just 240 milliseconds.
Regex is faster for large string than an if (perhaps in a for loops) to check if anything matches your requirement.
Caveat: I don't really know much about regex internals, and this is really conjecture. And I can't answer why Java suffers from this, but not the others (also, it is substantially faster than your 12 seconds in jshell 11 when I run it, so it perhaps only affects certain versions).
"aaaaaaaaaaaaaaaaaaaaaaaaaaaabs".matches("(a+)+b")
There are lots of ways that lots of a
s could match:
(a)(a)(a)(a)
(aa)(a)(a)
(a)(aa)(a)
(aa)(aa)
(a)(aaa)
etc.
For the input string "aaaaaaaaaaaaaaaaaaaaaaaaaaaab"
, it will greedily match all of those a
s in a single pass, match the b
, job done.
For "aaaaaaaaaaaaaaaaaaaaaaaaaaaabs"
, when it gets to the end and finds that the string doesn't match (because of the s
), it's not correctly recognizing that the s
means it can never match. So, having gone through and likely matched as
(aaaaaaaaaaaaaaaaaaaaaaaaaaaa)bs
it thinks "Oh, maybe it failed because of the way I grouped the a
s - and goes back and tries all the other combinations of the a
s.
(aaaaaaaaaaaaaaaaaaaaaaaaaaa)(a)bs // Nope, still no match
(aaaaaaaaaaaaaaaaaaaaaaaaaa)(aa)bs // ...
(aaaaaaaaaaaaaaaaaaaaaaaaa)(aaa)bs // ...
...
(a)(aaaaaaaaaaaaaaaaaaaaaaaaaaa)bs // ...
(aaaaaaaaaaaaaaaaaaaaaaaaaa(a)(a)bs // ...
(aaaaaaaaaaaaaaaaaaaaaaaaa(aa)(a)bs // ...
(aaaaaaaaaaaaaaaaaaaaaaaa(aaa)(a)bs // ...
...
There are lots of these (I think there are something like 2^27 - that's 134,217,728 - combinations for 28 a
s, because each a
can either be part of the previous group, or start its own group), so it takes a long time.
I don't know Perl too well but the Python version is not equivalent to the Java one. You are using search()
but the Java version is using matches()
. The equivalent method in Python would be fullmatch()
When I run your examples in Python (3.8.2) with search()
I get quick results as you do. When I run it with fullmatch()
I get poor (multi-second) execution time. Could it be that your Perl example is also not doing a full match?
BTW: if you want to try the Java version of search you would use:
Pattern.compile("(a+)+b").matcher("aaaaaaaaaaaaaaaaaaaaaaaaaaaabs").find();
There might be some slight difference in the semantics but it should be close enough for this purpose.
The extra +
causes a lot of backtracking (in a naive regexp implementation) when the string cannot be matched. If the string can be matched, the answer is known in the first try. This explains why the case 2 is fast and only case 3 is slow.
The site https://swtch.com/~rsc/regexp/regexp1.html has some detailed information on regular expression implementation techniques and the theory behind them. I know link only answers are bad, but this is worth reading, showing an example regular expression that completes in 30 micro seconds with the better implementation, and 60 seconds (2 million times slower) with the better known and more obvious way.
It says
"Today, regular expressions have also become a shining example of how ignoring good theory leads to bad programs. The regular expression implementations used by today's popular tools are significantly slower than the ones used in many of those thirty-year-old Unix tools."
Other answers saying that the extra +
causes too much backtracking are correct, but only if you ignore the good theory.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With