Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

My regex is causing a stack overflow in Java; what am I missing?

I am attempting to use a regular expression with Scanner to match a string from a file. The regex works with all of the contents of the file except for this line:

DNA="ITTTAITATIATYAAAYIYI[....]ITYTYITTIYAIAIYIT"

in the actual file, the ellipsis represents several thousand more characters.

When the loop that reads the file arrives on the line containing the bases, a stack overflow error occurs.

Here is the loop:

while (scanFile.hasNextLine()) {
   final String currentLine = scanFile.findInLine(".*");
   System.out.println("trying to match '" + currentLine + "'");
   Scanner internalScanner = new Scanner(currentLine);
   String matchResult = internalScanner.findInLine(Constants.ANIMAL_INFO_REGEX);
   assert matchResult != null : "there's no reason not to find a match"; 
   matches.put(internalScanner.match().group(1), internalScanner.match().group(2));
   scanFile.nextLine();
  }

and the regex:

static final String ANIMAL_INFO_REGEX = "([a-zA-Z]+) *= *\"(([a-zA-Z_.]| |\\.)+)";

Here's the failure trace:

java.lang.StackOverflowError
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3360)
    at java.util.regex.Pattern$Branch.match(Pattern.java:4131)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4185)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4312)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4244)
    at java.util.regex.Pattern$BranchConn.match(Pattern.java:4095)
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3362)
    at java.util.regex.Pattern$Branch.match(Pattern.java:4131)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4185)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4312)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4244)
    at java.util.regex.Pattern$BranchConn.match(Pattern.java:4095)
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3362)
    at java.util.regex.Pattern$Branch.match(Pattern.java:4131)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4185)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4312)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4244)
    at java.util.regex.Pattern$BranchConn.match(Pattern.java:4095)
    ...etc (it's all regex).

Thanks so much!

like image 671
Trevor Avatar asked Sep 10 '10 02:09

Trevor


People also ask

How to fix stack Overflow error in java?

Increase Thread Stack Size (-Xss) Increasing the stack size can be useful, for example, when the program involves calling a large number of methods or using lots of local variables. This will set the thread's stack size to 4 mb which should prevent the JVM from throwing a java. lang. StackOverflowError .

What causes stack Overflow error in java?

The main cause of the StackOverflowError is that we haven't provided the proper terminating condition to our recursive function or template, which means it will turn into an infinite loop.

What is regular expression stack overflow?

Questions tagged [regex] Ask Question. Regular expressions provide a declarative language to match patterns within strings. They are commonly used for string validation, parsing, and transformation.


4 Answers

This looks like bug 5050507 . I agree with Asaph that removing the alternation should help; the bug specifically says "Avoid alternation whenever possible". I think you can go probably even simpler:

"^([a-zA-Z]+) *= *\"([^\"]+)"
like image 153
Matthew Flaschen Avatar answered Nov 06 '22 19:11

Matthew Flaschen


Try this simplified version of your regex that removes some unnecessary | operators (which might have been causing the regex engine to do a lot of branching) and includes beginning and end of line anchors.

static final String ANIMAL_INFO_REGEX = "^([a-zA-Z]+) *= *\"([a-zA-Z_. ]+)\"$";
like image 33
Asaph Avatar answered Nov 06 '22 19:11

Asaph


read this to understand the problem: http://www.regular-expressions.info/catastrophic.html ... and then use one of the other suggestions

like image 2
Zac Thompson Avatar answered Nov 06 '22 20:11

Zac Thompson


As the others have said, your regex is much less efficient than it should be. I'd take it a step further and use possessive quantifiers:

"^([a-zA-Z]++) *+= *+\"([^\"]++)\"$"

But the way you're using the Scanner doesn't make much sense, either. There's no need to use findInLine(".*") to read the line; that's what nextLine() does. And you don't need to create another Scanner to apply your regex; just use a Matcher.

static final Pattern ANIMAL_INFO_PATTERN = 
    Pattern.compile("^([a-zA-Z]++) *+= *+\"([^\"]++)\"$");

...

  Matcher lineMatcher = ANIMAL_INFO_PATTERN.matcher("");
  while (scanFile.hasNextLine()) {
    String currentLine = scanFile.nextLine();
    if (lineMatcher.reset(currentLine).matches()) {
      matches.put(lineMatcher.group(1), lineMatcher.group(2));
    }
  }
like image 1
Alan Moore Avatar answered Nov 06 '22 20:11

Alan Moore