Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java 7, regexes and supplementary unicode characters

The string in question has a supplementary unicode character "\ud84c\udfb4". According to javadoc, regex matching should be done at code point level not character level. However, the split code below treats low surrogate (\udfb4) as non word character and splits on it.

Am I missing something? What are other alternatives to accomplish splitting on non-word characters? (Java version "1.7.0_07")

Thanks in advance.

Pattern non_word_regex = Pattern.compile("[\\W]", Pattern.UNICODE_CHARACTER_CLASS);
String a = "\u529f\u80fd\u0020\u7d76\ud84c\udfb4\u986f\u793a\u5ee3\u544a";
String b ="功能 絶𣎴顯示廣告";
System.out.print("original "+a+"\norginal hex ");
for(char c : a.toCharArray()){
    System.out.print(Integer.toHexString((int)c));
    System.out.print(' ');
}
System.out.println();

String[] tokens = non_word_regex.split(a);

for(int i =0; i< tokens.length; i++){
   String token = tokens[i];
   System.out.print(i+" ");
   for(char c : token.toCharArray()){
       System.out.print(Integer.toHexString((int)c));
       System.out.print(' ');
   }
   System.out.println();
}

Output:
original 功能 絶𣎴顯示廣告
orginal hex 529f 80fd 20 7d76 d84c dfb4 986f 793a 5ee3 544a
0 529f 80fd
1 7d76 d84c
2 986f 793a 5ee3 544a

like image 767
user3088039 Avatar asked Dec 10 '13 18:12

user3088039


People also ask

Does regex work with Unicode?

This will make your regular expressions work with all Unicode regex engines. In addition to the standard notation, \p{L}, Java, Perl, PCRE, the JGsoft engine, and XRegExp 3 allow you to use the shorthand \pL. The shorthand only works with single-letter Unicode properties.

What is the use of \\ in Java?

The backslash \ is an escape character in Java Strings. That means backslash has a predefined meaning in Java. You have to use double backslash \\ to define a single backslash. If you want to define \w , then you must be using \\w in your regex.

What is a supplementary character in Java?

Supplementary characters are characters with code points in the range U+10000 to U+10FFFF, that is, those characters that could not be represented in the original 16-bit design of Unicode. The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP).

What is the regex for Unicode paragraph separator?

\u000d — Carriage return — \r. \u2028 — Line separator. \u2029 — Paragraph separator.


1 Answers

This looks simply like a bug in the regex engine. If you use the \w expression, everything matches correctly, 𣎴 remains to be a single code point composed of two chars. This can be easily verified by running the following code:

Pattern pattern = Pattern.compile("(?U)[\\w]");
String str = "功能 絶𣎴顯示廣告";

Matcher matcher = pattern.matcher(str);
while (matcher.find()) {
    System.out.println(matcher.toMatchResult().group());
}

I've just made a through investigation, and so I can tell you where the problem is. If you look at the method compile() in java.util.regex.Pattern (start on the line 1625), you will see the code that scans the regex for supplementary characters and decides whether to support them in scanning or not.

The problem with this approach is that the code doesn't take into account the fact that even if the regex doesn't have supplementary characters, it may still want to match them, as it happens in your case, for example.

The solution is to devise some regex that contains the supplementary characters, but they don't affect the matching process. I suggest you use something innocent like this:

Pattern nonWordRegex = Pattern.compile("(?U)(?!\uDB80\uDC00)[\\W]");

The part (?!\uDB80\uDC00) does the trick. This is a negative lookahead for a character in the private range of supplementary characters, which means that most likely you won't find it in the text. And voila: the regex engine thinks that there are supplementary characters in the pattern, and turns on their support!

like image 175
Malcolm Avatar answered Sep 27 '22 19:09

Malcolm