Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex look-behind without obvious maximum length in Java

Tags:

I always thought that a look-behind assertion in Java's regex-API (and many other languages for that matter) must have an obvious length. So, STAR and PLUS quantifiers are not allowed inside look-behinds.

The excellent online resource regular-expressions.info seems to confirm (some of) my assumptions:

"[...] Java takes things a step further by allowing finite repetition. You still cannot use the star or plus, but you can use the question mark and the curly braces with the max parameter specified. Java recognizes the fact that finite repetition can be rewritten as an alternation of strings with different, but fixed lengths. Unfortunately, the JDK 1.4 and 1.5 have some bugs when you use alternation inside lookbehind. These were fixed in JDK 1.6. [...]"

-- http://www.regular-expressions.info/lookaround.html

Using the curly brackets works as long as the total length of range of the characters inside the look-behind is smaller or equal to Integer.MAX_VALUE. So these regexes are valid:

"(?<=a{0,"   +(Integer.MAX_VALUE)   + "})B" "(?<=Ca{0,"  +(Integer.MAX_VALUE-1) + "})B" "(?<=CCa{0," +(Integer.MAX_VALUE-2) + "})B" 

But these aren't:

"(?<=Ca{0,"  +(Integer.MAX_VALUE)   +"})B" "(?<=CCa{0," +(Integer.MAX_VALUE-1) +"})B" 

However, I don't understand the following:

When I run a test using the * and + quantifier inside a look-behind, all goes well (see output Test 1 and Test 2).

But, when I add a single character at the start of the look-behind from Test 1 and Test 2, it breaks (see output Test 3).

Making the greedy * from Test 3 reluctant has no effect, it still breaks (see Test 4).

Here's the test harness:

public class Main {      private static String testFind(String regex, String input) {         try {             boolean returned = java.util.regex.Pattern.compile(regex).matcher(input).find();             return "testFind       : Valid   -> regex = "+regex+", input = "+input+", returned = "+returned;         } catch(Exception e) {             return "testFind       : Invalid -> "+regex+", "+e.getMessage();         }     }      private static String testReplaceAll(String regex, String input) {         try {             String returned = input.replaceAll(regex, "FOO");             return "testReplaceAll : Valid   -> regex = "+regex+", input = "+input+", returned = "+returned;         } catch(Exception e) {             return "testReplaceAll : Invalid -> "+regex+", "+e.getMessage();         }     }      private static String testSplit(String regex, String input) {         try {             String[] returned = input.split(regex);             return "testSplit      : Valid   -> regex = "+regex+", input = "+input+", returned = "+java.util.Arrays.toString(returned);         } catch(Exception e) {             return "testSplit      : Invalid -> "+regex+", "+e.getMessage();         }     }      public static void main(String[] args) {         String[] regexes = {"(?<=a*)B", "(?<=a+)B", "(?<=Ca*)B", "(?<=Ca*?)B"};         String input = "CaaaaaaaaaaaaaaaBaaaa";         int test = 0;         for(String regex : regexes) {             test++;             System.out.println("********************** Test "+test+" **********************");             System.out.println("    "+testFind(regex, input));             System.out.println("    "+testReplaceAll(regex, input));             System.out.println("    "+testSplit(regex, input));             System.out.println();         }     } } 

The output:

********************** Test 1 **********************     testFind       : Valid   -> regex = (?<=a*)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = true     testReplaceAll : Valid   -> regex = (?<=a*)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = CaaaaaaaaaaaaaaaFOOaaaa     testSplit      : Valid   -> regex = (?<=a*)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = [Caaaaaaaaaaaaaaa, aaaa]  ********************** Test 2 **********************     testFind       : Valid   -> regex = (?<=a+)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = true     testReplaceAll : Valid   -> regex = (?<=a+)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = CaaaaaaaaaaaaaaaFOOaaaa     testSplit      : Valid   -> regex = (?<=a+)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = [Caaaaaaaaaaaaaaa, aaaa]  ********************** Test 3 **********************     testFind       : Invalid -> (?<=Ca*)B, Look-behind group does not have an obvious maximum length near index 6 (?<=Ca*)B       ^     testReplaceAll : Invalid -> (?<=Ca*)B, Look-behind group does not have an obvious maximum length near index 6 (?<=Ca*)B       ^     testSplit      : Invalid -> (?<=Ca*)B, Look-behind group does not have an obvious maximum length near index 6 (?<=Ca*)B       ^  ********************** Test 4 **********************     testFind       : Invalid -> (?<=Ca*?)B, Look-behind group does not have an obvious maximum length near index 7 (?<=Ca*?)B        ^     testReplaceAll : Invalid -> (?<=Ca*?)B, Look-behind group does not have an obvious maximum length near index 7 (?<=Ca*?)B        ^     testSplit      : Invalid -> (?<=Ca*?)B, Look-behind group does not have an obvious maximum length near index 7 (?<=Ca*?)B        ^ 

My question may be obvious, but I'll still ask it: Can anyone explain to me why Test 1 and 2 fail, and Test 3 and 4 don't? I would have expected them all to fail, not half of them to work and half of them to fail.

Thanks.

PS. I'm using: Java version 1.6.0_14

like image 206
Bart Kiers Avatar asked Oct 08 '09 10:10

Bart Kiers


1 Answers

Glancing at the source code for Pattern.java reveals that the '*' and '+' are implemented as instances of Curly (which is the object created for curly operators). So,

a* 

is implemented as

a{0,0x7FFFFFFF} 

and

a+ 

is implemented as

a{1,0x7FFFFFFF} 

which is why you see exactly the same behaviors for curlies and stars.

like image 126
Jonathan Feinberg Avatar answered Sep 28 '22 09:09

Jonathan Feinberg