Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cannot match string using regex

Tags:

java

regex

I am working on some regex and I wonder why this regex

"(?<=(.*?id(( *)=)\\s[\"\']))g"

does not match the string

<input id = "g" />

in Java?

like image 653
Betamoo Avatar asked Oct 06 '10 20:10

Betamoo


2 Answers

Java.util.regex does not support infinite look-behind, as described in by RegexBuddy:

The bad news is that most regex flavors do not allow you to use just any regex inside a lookbehind, because they cannot apply a regular expression backwards. Therefore, the regular expression engine needs to be able to figure out how many steps to step back before checking the lookbehind.

To add a little clarification from the documentation:

Therefore, many regex flavors, including those used by Perl and Python, only allow fixed-length strings. You can use any regex of which the length of the match can be predetermined. This means you can use literal text and character classes. You cannot use repetition or optional items. You can use alternation, but only if all options in the alternation have the same length.

Some regex flavors, like PCRE and Java support the above, plus alternation with strings of different lengths. Each part of the alternation must still have a finite maximum length. This means you can still not use the star or plus, but you can use the question mark and the curly braces with the max parameter specified. These regex flavors recognize the fact that finite repetition can be rewritten as an alternation of strings with different, but fixed lengths. Unfortunately, the JDK 1.4 and 1.5 have some bugs when you use alternation inside lookbehind. These were fixed in JDK 1.6.

like image 113
Mike Avatar answered Oct 04 '22 14:10

Mike


Not only does Java not allow unbounded lookbehind, it's supposed to throw an exception if you try. The fact that you're not seeing that exception is itself a bug.

You shouldn't be using lookbehind for that anyway. If you want to match the value of a certain attribute, the easiest, least troublesome approach is to match the whole attribute and use a capturing group to extract the value. For example:

String source = "<input id = \"g\" />"; 
Pattern p = Pattern.compile("\\bid\\s*=\\s*\"([^\"]*)\"");
Matcher m = p.matcher(source);
if (m.find())
{
  System.out.printf("Found 'id' attribute '%s' at position %d%n",
                    m.group(1), m.start());
}

Output:

Found 'id' attribute 'g' at position 7

Do yourself a favor and forget about lookbehinds for a while. They're tricky even when they're not buggy, and they're really not as useful as you might expect.

like image 21
Alan Moore Avatar answered Oct 04 '22 14:10

Alan Moore