Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java case-insensitive regex matching doesn't work with letter Ñ

Tags:

java

regex

Consider this program:

import java.util.regex.Pattern;
public class xx {

    /*
     *  Ñ
     *  LATIN CAPITAL LETTER N WITH TILDE
     *  Unicode: U+00D1, UTF-8: C3 91
     */
    public static final String BIG_N = "\u00d1";

    /*
     *  ñ
     *  LATIN SMALL LETTER N WITH TILDE
     *  Unicode: U+00F1, UTF-8: C3 B1
     */
    public static final String LITTLE_N = "\u00f1";

    public static void main(String[] args) throws Exception {
        System.out.println(BIG_N.equalsIgnoreCase(LITTLE_N));
        System.out.println(Pattern.compile(BIG_N, Pattern.CASE_INSENSITIVE).matcher(LITTLE_N).matches());
    }
}

Since Ñ is the upper-case version of ñ, you would expect it to print:

true
true

but what it actually prints (java 1.7.0_17-b02) is:

true
false

Why?

like image 942
Archie Avatar asked May 20 '13 22:05

Archie


People also ask

Is Java regex case insensitive?

Java Regular Expression is used to find, match, and extract data from character sequences. Java Regular Expressions are case-sensitive by default.

Are regex matches case-sensitive?

By default, the comparison of an input string with any literal characters in a regular expression pattern is case-sensitive, white space in a regular expression pattern is interpreted as literal white-space characters, and capturing groups in a regular expression are named implicitly as well as explicitly.

What does \\ mean in Java regex?

Backslashes in Java. The backslash \ is an escape character in Java Strings. That means backslash has a predefined meaning in Java. You have to use double backslash \\ to define a single backslash. If you want to define \w , then you must be using \\w in your regex.


1 Answers

By default, case-insensitive matching assumes that only characters in the US-ASCII charset are being matched. Unicode-aware case-insensitive matching can be enabled by specifying the UNICODE_CASE flag in conjunction with this flag.

http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#CASE_INSENSITIVE

And for completeness; you or (|) the flags together.

Pattern.compile(BIG_N, Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE)
like image 109
Brian Roach Avatar answered Sep 20 '22 15:09

Brian Roach