Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Single Chinese character determined as length 2 in Java/Scala String

I'm trying to split out all the Chinese characters from a String, but I bumped into a strange situation for the character 𥑮

scala> "𥑮"
res1: String = 𥑮

scala> res1.length
res2: Int = 2

scala> res1.getBytes
res3: Array[Byte] = Array(-16, -91, -111, -82)

scala> res1(0)
res4: Char = ?

scala> res1(1)
res5: Char = ?

It's a single character, but Java/Scala determine it as two unknown characters. And usually I see Chinese character taking three bytes in UTF-8, but this character takes four.

Hence, I can't split a String and find this single character. Even worse, when using myString.replaceAll("[^\\p{script=Han}]", "") to kick out all the non-Chinese characters, the second part of 𥑮 is replaced and it becomes an invalid String.

Is there any solution to this? I'm using openjdk-8-jdk on Ubuntu.

like image 370
pishen Avatar asked Feb 27 '15 09:02

pishen


People also ask

Are Chinese characters UTF 8?

Unicode/UTF-8 characters include: Chinese characters. any non-Latin scripts (Hebrew, Cyrillic, Japanese, etc.) symbols.

How do you assign the length of a string in Java?

The setLength(int newLength) method of StringBuilder is used to set the length of the character sequence equal to newLength. For every index k greater then 0 and less than newLength. If the newLength passed as argument is less than the old length, the old length is changed to the newLength.

How many bits is a Chinese character?

Each Chinese character is represented by a 3-byte code in which each byte is 7-bit, between 0x21 and 0x7E inclusive.

Does Java use UTF 8 or UTF-16?

The native character encoding of the Java programming language is UTF-16. A charset in the Java platform therefore defines a mapping between sequences of sixteen-bit UTF-16 code units (that is, sequences of chars) and sequences of bytes.


1 Answers

For length you should use

string.codePointCount(0, string.length());

For replacement it is best to avoid regex, which is char-based. You could write a loop relying on String#offsetByCodePoints() and manually remove characters based on String.codePointAt() and Character.isIdeographic().

like image 136
Marko Topolnik Avatar answered Nov 16 '22 01:11

Marko Topolnik