Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java regular expression optimization tips

Tags:

java

regex

I am new to Java Regular expression. We are using a pattern for matching a string. We are using this for validating a text field and it meets our requirements. But there is a performance issue in the matching.

Pattern : ([a-zA-Z0-9]+[ ]?(([_\-][a-zA-Z0-9 ])*)?[_\-]?)+

  1. Input text should start with a-zA-Z0-9.
  2. Space(single) is allowed between words
  3. "_" and "-" are allowed but cannot be consecutive.

Our problem is, for certain input strings the CPU time goes high and causes hanging the threads. Also we get exceptions. Can anyone please help me to optimize the Pattern or suggest a new pattern to solve my issue.

Exception details                              
============================================                           
Hung thread details, all the same:
[9/28/11 11:40:07:320 CDT] 00000003 ThreadMonitor W   WSVR0605W: Thread "WebContainer : 26" (0000004f) has been active for 709755 mi
lliseconds and may be hung.  There is/are 1 thread(s) in total in the server that may be hung.
        at java.util.regex.Pattern$GroupCurly.match(Pattern.java:3938)
        at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
        at java.util.regex.Pattern$Branch.match(Pattern.java:4124)
        at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
        at java.util.regex.Pattern$Curly.match0(Pattern.java:3801)
        at java.util.regex.Pattern$Curly.match(Pattern.java:3756)
        at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
        at java.util.regex.Pattern$Loop.match(Pattern.java:4307)
        at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
        at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
        at java.util.regex.Pattern$BranchConn.match(Pattern.java:4090)
        at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
        at java.util.regex.Pattern$GroupCurly.match0(Pattern.java:4006)
        at java.util.regex.Pattern$GroupCurly.match(Pattern.java:3928)
        at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
        at java.util.regex.Pattern$Branch.match(Pattern.java:4124)
        at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
        at java.util.regex.Pattern$Curly.match0(Pattern.java:3794)
        at java.util.regex.Pattern$Curly.match(Pattern.java:3756)
        at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
        at java.util.regex.Pattern$Loop.match(Pattern.java:4307)
        at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
        at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
        at java.util.regex.Pattern$Branch.match(Pattern.java:4124)
        at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
        at java.util.regex.Pattern$Curly.match0(Pattern.java:3794)
        at java.util.regex.Pattern$Curly.match(Pattern.java:3756)
        at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
        at java.util.regex.Pattern$Loop.match(Pattern.java:4307)
        at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
        at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
        at java.util.regex.Pattern$BranchConn.match(Pattern.java:4090)
        at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
        at java.util.regex.Pattern$GroupCurly.match0(Pattern.java:4006)
        at java.util.regex.Pattern$GroupCurly.match(Pattern.java:3928)
        at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
        at java.util.regex.Pattern$Branch.match(Pattern.java:4124)
        at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
        at java.util.regex.Pattern$Curly.match0(Pattern.java:3794)
        at java.util.regex.Pattern$Curly.match(Pattern.java:3756)
        at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
        at java.util.regex.Pattern$Loop.match(Pattern.java:4307)
        at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
        at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
        at java.util.regex.Pattern$BranchConn.match(Pattern.java:4090)
        at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
        at java.util.regex.Pattern$GroupCurly.match0(Pattern.java:4006)
        at java.util.regex.Pattern$GroupCurly.match(Pattern.java:3928)
        at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
        at java.util.regex.Pattern$Branch.match(Pattern.java:4124)
        at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
        at java.util.regex.Pattern$Curly.match0(Pattern.java:3794)
        at java.util.regex.Pattern$Curly.match(Pattern.java:3756)
        at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
        at java.util.regex.Pattern$Loop.match(Pattern.java:4307)
        at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
        at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
        at java.util.regex.Pattern$Branch.match(Pattern.java:4124)
        at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
        at java.util.regex.Pattern$Curly.match0(Pattern.java:3794)
        at java.util.regex.Pattern$Curly.match(Pattern.java:3756)
        at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
        at java.util.regex.Pattern$Loop.match(Pattern.java:4307)
        at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
        at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
        at java.util.regex.Pattern$Branch.match(Pattern.java:4124)
        at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
        at java.util.regex.Pattern$Curly.match0(Pattern.java:3801)
        at java.util.regex.Pattern$Curly.match(Pattern.java:3756)
        at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
        at java.util.regex.Pattern$Loop.match(Pattern.java:4307)
        at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
        at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
        at java.util.regex.Pattern$BranchConn.match(Pattern.java:4090)
        at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
        at java.util.regex.Pattern$GroupCurly.match0(Pattern.java:4006)
        at java.util.regex.Pattern$GroupCurly.match(Pattern.java:3928)
        at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
        at java.util.regex.Pattern$Branch.match(Pattern.java:4124)
        at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
        at java.util.regex.Pattern$Curly.match0(Pattern.java:3794)
        at java.util.regex.Pattern$Curly.match(Pattern.java:3756)
        at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
        at java.util.regex.Pattern$Loop.match(Pattern.java:4307)
        at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
        at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
        at java.util.regex.Pattern$Branch.match(Pattern.java:4124)
        at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
        at java.util.regex.Pattern$Curly.match0(Pattern.java:3794)
        at java.util.regex.Pattern$Curly.match(Pattern.java:3756)
        at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
        at java.util.regex.Pattern$Loop.match(Pattern.java:4307)
        at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
        at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
        at java.util.regex.Pattern$BranchConn.match(Pattern.java:4090)
        at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
        at java.util.regex.Pattern$GroupCurly.match0(Pattern.java:4006)
        at java.util.regex.Pattern$GroupCurly.match(Pattern.java:3928)
        at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
        at java.util.regex.Pattern$Branch.match(Pattern.java:4124)
        at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
        at java.util.regex.Pattern$Curly.match0(Pattern.java:3794)
        at java.util.regex.Pattern$Curly.match(Pattern.java:3756)
        at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
        at java.util.regex.Pattern$Loop.match(Pattern.java:4307)
        at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
        at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
        at java.util.regex.Pattern$Branch.match(Pattern.java:4124)
        at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
        at java.util.regex.Pattern$Curly.match0(Pattern.java:3794)
        at java.util.regex.Pattern$Curly.match(Pattern.java:3756)
        at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
        at java.util.regex.Pattern$Loop.match(Pattern.java:4307)
        at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
        at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
        at java.util.regex.Pattern$Branch.match(Pattern.java:4124)
        at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
        at java.util.regex.Pattern$Curly.match0(Pattern.java:3794)
        at java.util.regex.Pattern$Curly.match(Pattern.java:3756)
        at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
        at java.util.regex.Pattern$Loop.match(Pattern.java:4307)
        at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
        at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
        at java.util.regex.Pattern$BranchConn.match(Pattern.java:4090)
        at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
        at java.util.regex.Pattern$GroupCurly.match0(Pattern.java:4006)
        at java.util.regex.Pattern$GroupCurly.match(Pattern.java:3928)
        at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
        at java.util.regex.Pattern$Branch.match(Pattern.java:4124)
        at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
        at java.util.regex.Pattern$Curly.match0(Pattern.java:3794)
        at java.util.regex.Pattern$Curly.match(Pattern.java:3756)
        at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
        at java.util.regex.Pattern$Loop.match(Pattern.java:4307)
        at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
        at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
        at java.util.regex.Pattern$BranchConn.match(Pattern.java:4090)
        at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
        at java.util.regex.Pattern$GroupCurly.match0(Pattern.java:4006)
        at java.util.regex.Pattern$GroupCurly.match(Pattern.java:3928)
        at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
        at java.util.regex.Pattern$Branch.match(Pattern.java:4124)
        at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
        at java.util.regex.Pattern$Curly.match0(Pattern.java:3794)
        at java.util.regex.Pattern$Curly.match(Pattern.java:3756)
        at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
        at java.util.regex.Pattern$Loop.matchInit(Pattern.java:4323)
        at java.util.regex.Pattern$Prolog.match(Pattern.java:4263)
        at java.util.regex.Matcher.match(Matcher.java:1139)
        at java.util.regex.Matcher.matches(Matcher.java:514)
like image 737
dpkcv Avatar asked Sep 30 '11 07:09

dpkcv


1 Answers

Your problem is catastrophic backtracking which you're getting because you have nested quantifiers. This starts becoming a problem when the text doesn't match the requirements because of the exponentially increasing number of permutations the regex engine has to go through to declare failure.

([a-zA-Z0-9]+[ ]?(([_\-][a-zA-Z0-9 ])*)?[_\-]?)+
                                     ^         ^
                                     |         repetition
                                     repetition

Reconstruct your regex like this:

(?i)^(?!.*(?:--|__))[A-Z0-9][\w-]*(?: [\w-]+)*$

Java, with explanation:

boolean foundMatch = subjectString.matches(
    "(?ix)      # Case-insensitive, multiline regex:\n" +
    "^          # Start of string\n" +
    "(?!        # Assert that it's impossible to match\n" +
    " .*        # any number of characters\n" +
    " (?:--|__) # followed by -- or __\n" +
    ")          # End of lookahead assertion\n" +
    "[A-Z0-9]   # Match A-Z, a-z or 0-9\n" +
    "[\\w-]*    # Match 0 or more alnums/dash\n" +
    "(?:        # Match the following:\n" +
    " [\\ ]     # a single space\n" +
    " [\\w-]+   # followed by one or more alnums or -\n" +
    ")*         # any number of times\n" +
    "$          # End of string");

Note that the string must not end in a space as per your requirements. In case you're wondering, \w is a shorthand for [A-Za-z0-9_].

like image 106
Tim Pietzcker Avatar answered Oct 20 '22 20:10

Tim Pietzcker