Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is this regular expression so slow in Java? [duplicate]

I recently had a SonarQube rule (https://rules.sonarsource.com/java/RSPEC-4784) bring to my attention some performance issues which could be used as a denial of service against a Java regular expression implementation.

Indeed, the following Java test shows how slow the wrong regular expression can be:

    import org.junit.Test;

    public class RegexTest {

    @Test
    public void fastRegex1() {
        "aaaaaaaaaaaaaaaaaaaaaaaaaaaabs".matches("(a+)b");
    }

    @Test
    public void fastRegex2() {
        "aaaaaaaaaaaaaaaaaaaaaaaaaaaab".matches("(a+)+b");
    }

    @Test
    public void slowRegex() {
        "aaaaaaaaaaaaaaaaaaaaaaaaaaaabs".matches("(a+)+b");
    }
}

As you can see, the first two tests are fast, the third one is incredibly slow (in Java 8)

Enter image description here

The same data and regex in Perl or Python, however, is not at all slow, which leads me to wonder why it is that this regular expression is so slow to evaluate in Java.

$ time perl -e '"aaaaaaaaaaaaaaaaaaaaaaaaaaaabs" =~ /(a+)+b/ && print "$1\n"'
aaaaaaaaaaaaaaaaaaaaaaaaaaaa

real    0m0.004s
user    0m0.000s
sys     0m0.004s

$ time python3 -c 'import re; m=re.search("(a+)+b","aaaaaaaaaaaaaaaaaaaaaaaaaaaabs"); print(m.group(0))'
aaaaaaaaaaaaaaaaaaaaaaaaaaaab

real    0m0.018s
user    0m0.015s
sys     0m0.004s

What is it about the extra matching modifier + or trailing character s in the data which makes this regular expression so slow, and why is it only specific to Java?

like image 829
AntonPiatek Avatar asked Aug 07 '20 08:08

AntonPiatek


People also ask

Why is my regex so slow?

The reason the regex is so slow is that the "*" quantifier is greedy by default, and so the first ". *" tries to match the whole string, and after that begins to backtrack character by character. The runtime is exponential in the count of numbers on a line.

Is regex slow in Java?

it takes about 40 micro second. No need to say when the number of string values exceeds a few thousands, it'll be too slow.

How fast is regex matching?

The bad regular expression took on average 10,100 milliseconds to process all 1,000,000 lines, while the good regular expression took just 240 milliseconds.

Is regex faster than for loop Java?

Regex is faster for large string than an if (perhaps in a for loops) to check if anything matches your requirement.


4 Answers

Caveat: I don't really know much about regex internals, and this is really conjecture. And I can't answer why Java suffers from this, but not the others (also, it is substantially faster than your 12 seconds in jshell 11 when I run it, so it perhaps only affects certain versions).

"aaaaaaaaaaaaaaaaaaaaaaaaaaaabs".matches("(a+)+b")

There are lots of ways that lots of as could match:

(a)(a)(a)(a)
(aa)(a)(a)
(a)(aa)(a)
(aa)(aa)
(a)(aaa)
etc.

For the input string "aaaaaaaaaaaaaaaaaaaaaaaaaaaab", it will greedily match all of those as in a single pass, match the b, job done.

For "aaaaaaaaaaaaaaaaaaaaaaaaaaaabs", when it gets to the end and finds that the string doesn't match (because of the s), it's not correctly recognizing that the s means it can never match. So, having gone through and likely matched as

(aaaaaaaaaaaaaaaaaaaaaaaaaaaa)bs

it thinks "Oh, maybe it failed because of the way I grouped the as - and goes back and tries all the other combinations of the as.

(aaaaaaaaaaaaaaaaaaaaaaaaaaa)(a)bs  // Nope, still no match
(aaaaaaaaaaaaaaaaaaaaaaaaaa)(aa)bs  // ...
(aaaaaaaaaaaaaaaaaaaaaaaaa)(aaa)bs  // ...
...
(a)(aaaaaaaaaaaaaaaaaaaaaaaaaaa)bs  // ...
(aaaaaaaaaaaaaaaaaaaaaaaaaa(a)(a)bs  // ...
(aaaaaaaaaaaaaaaaaaaaaaaaa(aa)(a)bs  // ...
(aaaaaaaaaaaaaaaaaaaaaaaa(aaa)(a)bs  // ...
...

There are lots of these (I think there are something like 2^27 - that's 134,217,728 - combinations for 28 as, because each a can either be part of the previous group, or start its own group), so it takes a long time.

like image 75
Andy Turner Avatar answered Oct 17 '22 12:10

Andy Turner


I don't know Perl too well but the Python version is not equivalent to the Java one. You are using search() but the Java version is using matches(). The equivalent method in Python would be fullmatch()

When I run your examples in Python (3.8.2) with search() I get quick results as you do. When I run it with fullmatch() I get poor (multi-second) execution time. Could it be that your Perl example is also not doing a full match?

BTW: if you want to try the Java version of search you would use:

Pattern.compile("(a+)+b").matcher("aaaaaaaaaaaaaaaaaaaaaaaaaaaabs").find();

There might be some slight difference in the semantics but it should be close enough for this purpose.

like image 38
JimmyJames Avatar answered Oct 17 '22 14:10

JimmyJames


The extra + causes a lot of backtracking (in a naive regexp implementation) when the string cannot be matched. If the string can be matched, the answer is known in the first try. This explains why the case 2 is fast and only case 3 is slow.

like image 14
Henry Avatar answered Oct 17 '22 13:10

Henry


The site https://swtch.com/~rsc/regexp/regexp1.html has some detailed information on regular expression implementation techniques and the theory behind them. I know link only answers are bad, but this is worth reading, showing an example regular expression that completes in 30 micro seconds with the better implementation, and 60 seconds (2 million times slower) with the better known and more obvious way.

It says

"Today, regular expressions have also become a shining example of how ignoring good theory leads to bad programs. The regular expression implementations used by today's popular tools are significantly slower than the ones used in many of those thirty-year-old Unix tools."

Other answers saying that the extra + causes too much backtracking are correct, but only if you ignore the good theory.

like image 8
icarus Avatar answered Oct 17 '22 12:10

icarus