Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex execution is too slow in java [duplicate]

Tags:

java

regex

My purpose is to match this kind of different urls:
url.com
my.url.com
my.extended.url.com
a.super.extended.url.com
and so on...

So, I decided to build the regex to have a letter or a number at start and end of the url, and to have a infinite number of "subdomains" with alphanumeric characters and a dot. For example, in "my.extended.url.com", "m" from "my" is the first class of the regex, "m" from "com" is the last class of the regex, and "y.", "extended." and "url." are the second class of the regex.

Using the pattern and subject in the code below, I want the find method to return me a false because this url must not match, but it uses 100% of CPU and seems to stay in an infinite loop.

    
    String subject = "www.association-belgo-palestinienne-be";
    Pattern pattern = Pattern.compile("^[A-Za-z0-9]\\.?([A-Za-z0-9_-]+\\.?)*[A-Za-z0-9]\\.[A-Za-z]{2,6}");

    Matcher m = pattern.matcher(subject);
    System.out.println("    Start");
    boolean hasFind = m.find();
    System.out.println("    Finish : " + hasFind);
  

Which only prints:

  
      Start
  

I can't reproduce the problem using regex testers.
Is it normal ? Is the problem coming from my regex ?
Could it be due to my Java version (1.6.0_22-b04 / JVM 64 bit 17.1-b03) ?

Thanks in advance for helping.

like image 703
carpediem Avatar asked Dec 21 '10 14:12

carpediem


2 Answers

It's not an infinite loop. The problem is that it's checking every possible match and not finding one. If you could let it run for a gazillion years, it will eventually terminate. See this article for a good explanation of what's happening under the hood.

Perhaps this regular expression is satisfactory (it terminates on the given string): ^[A-Za-z0-9][A-Za-z0-9_-]*(\\.[A-Za-z0-9_-]+)*\\.[A-Za-z]{2,6}$ (see http://ideone.com/Z0rlg)

like image 54
moinudin Avatar answered Oct 02 '22 08:10

moinudin


The problem is the ([A-Za-z0-9_-]+\\.?)* part of the regular expression. Note that it has a quantifier (+) inside another quantifier (*). This causes catastrophic backtracking - basically, it has to try an exponential number of matches in order to check the regular expression, at least the way most regular expression engines are implemented (including the Java one).

If you use possessive quantifiers, you will be able to avoid this problem, however that would change the meaning of your regex, and it would no longer match what you want it to match.

I think the trick here is to find a regex which expresses what you want to solve, without double quantifiers. For example, the following should work:

Pattern.compile("^[A-Za-z0-9]\\.?([A-Za-z0-9_-]|[A-Za-z0-9_-]\\.)*[A-Za-z0-9]\\.[A-Za-z]{2,6}$");

I think this expresses the same class of strings that you are trying to match, and should be much faster.

like image 30
Avi Avatar answered Oct 02 '22 09:10

Avi