I'm trying to split out all the Chinese characters from a String, but I bumped into a strange situation for the character 𥑮
scala> "𥑮"
res1: String = 𥑮
scala> res1.length
res2: Int = 2
scala> res1.getBytes
res3: Array[Byte] = Array(-16, -91, -111, -82)
scala> res1(0)
res4: Char = ?
scala> res1(1)
res5: Char = ?
It's a single character, but Java/Scala determine it as two unknown characters. And usually I see Chinese character taking three bytes in UTF-8, but this character takes four.
Hence, I can't split a String and find this single character. Even worse, when using myString.replaceAll("[^\\p{script=Han}]", "")
to kick out all the non-Chinese characters, the second part of 𥑮 is replaced and it becomes an invalid String.
Is there any solution to this? I'm using openjdk-8-jdk on Ubuntu.
Unicode/UTF-8 characters include: Chinese characters. any non-Latin scripts (Hebrew, Cyrillic, Japanese, etc.) symbols.
The setLength(int newLength) method of StringBuilder is used to set the length of the character sequence equal to newLength. For every index k greater then 0 and less than newLength. If the newLength passed as argument is less than the old length, the old length is changed to the newLength.
Each Chinese character is represented by a 3-byte code in which each byte is 7-bit, between 0x21 and 0x7E inclusive.
The native character encoding of the Java programming language is UTF-16. A charset in the Java platform therefore defines a mapping between sequences of sixteen-bit UTF-16 code units (that is, sequences of chars) and sequences of bytes.
For length you should use
string.codePointCount(0, string.length());
For replacement it is best to avoid regex, which is char-based. You could write a loop relying on String#offsetByCodePoints()
and manually remove characters based on String.codePointAt()
and Character.isIdeographic()
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With