I've identified some unexpected behavior in Java's regex implementation. When using java.util.regex.Pattern
and java.util.regex.Matcher
, the following regular expression does not match correctly on the input "Merlot"
when using Matcher's find()
method:
((?:White )?Zinfandel|Merlot)
If I change the order of the expressions inside the outermost matching group, Matcher's find()
method does match.
(Merlot|(?:White )?Zinfandel)
Here is some test code that illustrates the problem.
import java.util.regex.*; public class RegexTest { public static void main(String[] args) { Pattern pattern1 = Pattern.compile("((?:White )?Zinfandel|Merlot)"); Matcher matcher1 = pattern1.matcher("Merlot"); // prints "No Match :(" if (matcher1.find()) { System.out.println(matcher1.group(0)); } else { System.out.println("No match :("); } Pattern pattern2 = Pattern.compile("(Merlot|(?:White )?Zinfandel)"); Matcher matcher2 = pattern2.matcher("Merlot"); // prints "Merlot" if (matcher2.find()) { System.out.println(matcher2.group(0)); } else { System.out.println("No match :("); } } }
The expected output is:
Merlot Merlot
But the actual output is:
No Match :( Merlot
I've verified this unexpected behavior exists in Java version 1.7.0_11 on Ubuntu linux and also Java version 1.6.0_37 on OSX 10.8.2. I reported this behavior as a bug to Oracle yesterday and got back an automated email telling me my bug report has been received and has an internal review ID of 2441589. I can't find my bug report when I search for that id in their bug database. (Can you hear the crickets?)
Have I uncovered a bug in Java's presumably thoroughly tested and used regex implementation (hard to believe in 2013), or am I doing something wrong?
The backslash \ is an escape character in Java Strings. That means backslash has a predefined meaning in Java. You have to use double backslash \\ to define a single backslash. If you want to define \w , then you must be using \\w in your regex.
The Regex class itself is thread safe and immutable (read-only). That is, Regex objects can be created on any thread and shared between threads; matching methods can be called from any thread and never alter any global state.
Regex is faster for large string than an if (perhaps in a for loops) to check if anything matches your requirement. If you are using regex as to match very small text and small pattern and don't do it because the matcher function . find() is slower than a normal if statement of a switch statement.
A regular expression is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a pattern. They can be used to search, edit, or manipulate text and data. The java.util.regex package primarily consists of the following three classes −
The following:
import java.util.regex.*; public class T { public static void main( String args[] ) { System.out.println( Pattern.compile("(a)?bb|c").matcher("c").find() ); System.out.println( Pattern.compile("(a)?b|c").matcher("c").find() ); } }
prints
false true
on:
The following:
import java.util.regex.*; public class T { public static void main( String args[] ) { System.out.println( Pattern.compile("((a)?bb)|c").matcher("c").find() ); System.out.println( Pattern.compile("((a)?b)|c").matcher("c").find() ); } }
prints:
true true
It seems to be fixed in Java 1.8.
Welcome to Scala version 2.11.0-20130930-063927-2bba779702 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0-ea). Type in expressions to have them evaluated. Type :help for more information. scala> import java.util.regex._ import java.util.regex._ scala> Pattern.compile("((?:White )?Zinfandel|Merlot)") res0: java.util.regex.Pattern = ((?:White )?Zinfandel|Merlot) scala> .matcher("Merlot") res1: java.util.regex.Matcher = java.util.regex.Matcher[pattern=((?:White )?Zinfandel|Merlot) region=0,6 lastmatch=] scala> .find() res2: Boolean = true
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With