Consider this program:
import java.util.regex.Pattern;
public class xx {
/*
* Ñ
* LATIN CAPITAL LETTER N WITH TILDE
* Unicode: U+00D1, UTF-8: C3 91
*/
public static final String BIG_N = "\u00d1";
/*
* ñ
* LATIN SMALL LETTER N WITH TILDE
* Unicode: U+00F1, UTF-8: C3 B1
*/
public static final String LITTLE_N = "\u00f1";
public static void main(String[] args) throws Exception {
System.out.println(BIG_N.equalsIgnoreCase(LITTLE_N));
System.out.println(Pattern.compile(BIG_N, Pattern.CASE_INSENSITIVE).matcher(LITTLE_N).matches());
}
}
Since Ñ is the upper-case version of ñ, you would expect it to print:
true
true
but what it actually prints (java 1.7.0_17-b02) is:
true
false
Why?
Java Regular Expression is used to find, match, and extract data from character sequences. Java Regular Expressions are case-sensitive by default.
By default, the comparison of an input string with any literal characters in a regular expression pattern is case-sensitive, white space in a regular expression pattern is interpreted as literal white-space characters, and capturing groups in a regular expression are named implicitly as well as explicitly.
Backslashes in Java. The backslash \ is an escape character in Java Strings. That means backslash has a predefined meaning in Java. You have to use double backslash \\ to define a single backslash. If you want to define \w , then you must be using \\w in your regex.
By default, case-insensitive matching assumes that only characters in the US-ASCII charset are being matched. Unicode-aware case-insensitive matching can be enabled by specifying the UNICODE_CASE flag in conjunction with this flag.
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#CASE_INSENSITIVE
And for completeness; you or (|
) the flags together.
Pattern.compile(BIG_N, Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With