Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex \p{Cs} not matching symbol in Java 16

Does anyone know why the regex \p{Cs} does not match the symbol 񡠼 in Java 16? It used to match it in Java 11.

Java 11

jshell 
|  Welcome to JShell -- Version 11.0.7
|  For an introduction type: /help intro

jshell> import java.util.regex.*

jshell> var text = new StringBuilder().appendCodePoint(55622).appendCodePoint(56380)
text ==> 񡠼

jshell> Pattern.compile("\\p{Cs}").matcher(text).find()
$3 ==> true

Java 16

INFO: Created user preferences directory.
|  Welcome to JShell -- Version 16.0.1
|  For an introduction type: /help intro

jshell> import java.util.regex.*

jshell> var text = new StringBuilder().appendCodePoint(55622).appendCodePoint(56380)
text ==> 񡠼

jshell> Pattern.compile("\\p{Cs}").matcher(text).find()
$3 ==> false
like image 651
Sandro Batista Santos Avatar asked Mar 01 '23 12:03

Sandro Batista Santos


1 Answers

First, your “symbol 񡠼” has the codepoint 399420, which is not assigned by the Unicode standard (yet), so if you are seeing something useful here, it’s a non-standard behavior of your system.

The way you construct the string, is not correct, semantically, but happens to create the intended string. For historic reasons, Java’s API is centered around a UTF-16 representation.

When you define the symbol using two surrogate characters, i.e.

var text = "\uD946\uDC3C";
System.out.println(text.codePointAt(0));

you’ll get

399420

On the other hand, when you use

var text = new StringBuilder().appendCodePoint(399420);
text.chars().forEach(c -> System.out.printf("\\u%04X", c));
System.out.println();

you’ll get

\uD946\uDC3C

In other words, the sequence of the two surrogate UTF-16 char units \uD946, \uDC3C is equivalent to the single codepoint 399420. Conceptionally, the string consists of the single codepoint, in other words,

System.out.println(text.codePointCount(0, text.length()) + " codepoint(s)");
System.out.println(text.codePointAt(0));
System.out.println("type " + Character.getType(text.codePointAt(0)));

will print

1 codepoint(s)
399420
type 0

in either case. The type 0 indicates that this codepoint is unassigned.

You are using appendCodePoint for appending two UTF-16 units to the StringBuilder, but since this method treats codepoints of the BMP the same way as UTF-16 units, it happens to construct the same string, too.

Since the category of the codepoint is “unassigned”, it shouldn’t be “surrogate”, so \p{Cs} should never find a match here. When processing a valid Unicode string, you should never encounter this category, as it can only match dangling surrogate characters which can not be interpreted as a codepoint outside the BMP.

But there’s the bug JDK-8247546, Pattern matching does not skip correctly over supplementary characters. Before Java 16, the regex engine did process the codepoint at location zero correctly, but advanced only one char position, so it found a dangling surrogate character when looking at char position 1 alone.

We can verify it using

var m = Pattern.compile("\\p{Cs}").matcher(text);
if(m.find()) {
    System.out.println("found a match at " + m.start());
}

which prints “found a match at 1” prior to JDK 16, which is wrong, as position 1 should be skipped when there’s a single codepoint at char positions 0 and 1.

This bug has been fixed in JDK 16. So now, the string is treated as a single codepoint of the “unassigned” category. Of course, this category might change again in the future. But it should never be “surrogate”.

like image 51
Holger Avatar answered Mar 12 '23 05:03

Holger