Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pattern.matches() gives StackOverflowError

I'm using java's Pattern.matches to match a block of data to a regex. The block of data can be a single line or multiple lines. The problem is that once my data becomes more than 15 lines (typically more than 17-18 lines), i start getting stackoverflowerror. For data less than 15 lines the regex works fine.

The Regex is of this format:
domainname -> space -> , -> space -> number -> space -> , -> space -> number -> newline

String regex = "^(([a-zA-Z0-9][a-zA-Z0-9\\-]*\\.)+([a-zA-Z]{2,})\\s*,\\s*\\d+\\s*,\\s*\\d+(\\r?\\n)?)+$";

The data block i use to test against this regex is this

abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456

This is the code:

String regex = "^(([a-zA-Z0-9][a-zA-Z0-9\\-]*\\.)+([a-zA-Z]{2,})\\s*,\\s*\\d+\\s*,\\s*\\d+(\\r?\\n)?)+$";
boolean valid = Pattern.matches(regex, data); //fails here
like image 341
Purav Shah Avatar asked Oct 05 '11 14:10

Purav Shah


People also ask

What causes StackOverflowError?

StackOverflowError is a runtime error which points to serious problems that cannot be caught by an application. The java. lang. StackOverflowError indicates that the application stack is exhausted and is usually caused by deep or infinite recursion.

What would cause a StackOverflowError to occur in recursion?

The common cause for a stack overflow is a bad recursive call. Typically, this is caused when your recursive functions doesn't have the correct termination condition, so it ends up calling itself forever.


2 Answers

I can't tell you the reason for this error; the regex itself is fine and not subject to catastrophic backtracking or any other obvious error.

Perhaps you can reduce the number of backtracking positions the regex engine saves by using possessive quantifiers (++ instead of +, *+ instead of *, {2,}+ instead of {2,} etc.). Also, you don't need the capturing groups (thanks Thomas), so I've changed them into non-capturing ones:

"(?:(?:[a-zA-Z0-9][a-zA-Z0-9-]*+\\.)++([a-zA-Z]{2,}+)\\s*+,\\s*+\\d++\\s*+,\\s*+\\d++(\r?+\n)?+)++"

This won't change the behaviour of the regex (except for the removal of the unnecessary anchors since you're using Pattern.matches()), but perhaps it helps avoid StackOverflows. I don't have a Java SDK installed, so I can't test it myself, though.

like image 155
Tim Pietzcker Avatar answered Sep 21 '22 14:09

Tim Pietzcker


You might try and use atomic groups ((?>expression)) to prevent backtracking:

Here's a test that failed with a block of 1000 lines using your regex but succeeds now (takes a while, thus I only tested with 5000 20000 :) ):

String regex = "(?>(?>[a-zA-Z0-9][a-zA-Z0-9\\-]*\\.)+(?>[a-zA-Z]{2,})\\s*,\\s*\\d+\\s*,\\s*\\d+(?>\\r?\\n)?)+";

StringBuilder input = new StringBuilder();

for( int i = 0; i < 1000000; ++i) {
  input.append("abc.com, 123, 456\n");
}

Pattern p = Pattern.compile( regex );
Matcher m = p.matcher( input );

System.out.println(m.matches());

So after all, it might still be a backtracking problem.

Update: just let that test run with 20000 lines and still didn't fail. That's at least 20 times as much as before. :)

Update 2: looking at my test again I found the slow part, the string concatenation. (o..O). I've updated the test and used 1 Million lines, still no fail. :)

like image 37
Thomas Avatar answered Sep 23 '22 14:09

Thomas