The string in question has a supplementary unicode character "\ud84c\udfb4". According to javadoc, regex matching should be done at code point level not character level. However, the split code below treats low surrogate (\udfb4) as non word character and splits on it.
Am I missing something? What are other alternatives to accomplish splitting on non-word characters? (Java version "1.7.0_07")
Thanks in advance.
Pattern non_word_regex = Pattern.compile("[\\W]", Pattern.UNICODE_CHARACTER_CLASS);
String a = "\u529f\u80fd\u0020\u7d76\ud84c\udfb4\u986f\u793a\u5ee3\u544a";
String b ="功能 絶𣎴顯示廣告";
System.out.print("original "+a+"\norginal hex ");
for(char c : a.toCharArray()){
System.out.print(Integer.toHexString((int)c));
System.out.print(' ');
}
System.out.println();
String[] tokens = non_word_regex.split(a);
for(int i =0; i< tokens.length; i++){
String token = tokens[i];
System.out.print(i+" ");
for(char c : token.toCharArray()){
System.out.print(Integer.toHexString((int)c));
System.out.print(' ');
}
System.out.println();
}
Output:
original 功能 絶𣎴顯示廣告
orginal hex 529f 80fd 20 7d76 d84c dfb4 986f 793a 5ee3 544a
0 529f 80fd
1 7d76 d84c
2 986f 793a 5ee3 544a
This will make your regular expressions work with all Unicode regex engines. In addition to the standard notation, \p{L}, Java, Perl, PCRE, the JGsoft engine, and XRegExp 3 allow you to use the shorthand \pL. The shorthand only works with single-letter Unicode properties.
The backslash \ is an escape character in Java Strings. That means backslash has a predefined meaning in Java. You have to use double backslash \\ to define a single backslash. If you want to define \w , then you must be using \\w in your regex.
Supplementary characters are characters with code points in the range U+10000 to U+10FFFF, that is, those characters that could not be represented in the original 16-bit design of Unicode. The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP).
\u000d — Carriage return — \r. \u2028 — Line separator. \u2029 — Paragraph separator.
This looks simply like a bug in the regex engine. If you use the \w
expression, everything matches correctly, 𣎴 remains to be a single code point composed of two chars. This can be easily verified by running the following code:
Pattern pattern = Pattern.compile("(?U)[\\w]");
String str = "功能 絶𣎴顯示廣告";
Matcher matcher = pattern.matcher(str);
while (matcher.find()) {
System.out.println(matcher.toMatchResult().group());
}
I've just made a through investigation, and so I can tell you where the problem is. If you look at the method compile()
in java.util.regex.Pattern (start on the line 1625), you will see the code that scans the regex for supplementary characters and decides whether to support them in scanning or not.
The problem with this approach is that the code doesn't take into account the fact that even if the regex doesn't have supplementary characters, it may still want to match them, as it happens in your case, for example.
The solution is to devise some regex that contains the supplementary characters, but they don't affect the matching process. I suggest you use something innocent like this:
Pattern nonWordRegex = Pattern.compile("(?U)(?!\uDB80\uDC00)[\\W]");
The part (?!\uDB80\uDC00)
does the trick. This is a negative lookahead for a character in the private range of supplementary characters, which means that most likely you won't find it in the text. And voila: the regex engine thinks that there are supplementary characters in the pattern, and turns on their support!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With