Single Chinese character determined as length 2 in Java/Scala String

Tags:

I'm trying to split out all the Chinese characters from a String, but I bumped into a strange situation for the character 𥑮

scala> "𥑮"
res1: String = 𥑮

scala> res1.length
res2: Int = 2

scala> res1.getBytes
res3: Array[Byte] = Array(-16, -91, -111, -82)

scala> res1(0)
res4: Char = ?

scala> res1(1)
res5: Char = ?

It's a single character, but Java/Scala determine it as two unknown characters. And usually I see Chinese character taking three bytes in UTF-8, but this character takes four.

Hence, I can't split a String and find this single character. Even worse, when using myString.replaceAll("[^\\p{script=Han}]", "") to kick out all the non-Chinese characters, the second part of 𥑮 is replaced and it becomes an invalid String.

Is there any solution to this? I'm using openjdk-8-jdk on Ubuntu.

370

asked Feb 27 '15 09:02

pishen

1 Answers

For length you should use

string.codePointCount(0, string.length());

For replacement it is best to avoid regex, which is char-based. You could write a loop relying on String#offsetByCodePoints() and manually remove characters based on String.codePointAt() and Character.isIdeographic().

136

answered Nov 16 '22 01:11

Marko Topolnik

Related questions
                            
                                Lower bounded wildcard not checked against upper bounded type parameter
                            
                                Play framework: Server monitoring and performance admin page
                            
                                File upload using java websocket API and Javascript
                            
                                How to copy table from one database to another?
                            
                                Java Debug Interface, Lambdas and Line Numbers
                            
                                BindException thrown instead of MethodArgumentNotValidException in REST application
                            
                                Maven compile gives: Cannot find symbol - For a class sitting in the same app
                            
                                Spring RestTemplate connection reset
                            
                                Howto sanitize inputs
                            
                                Olingo - Create strongly typed POJOs for client library of OData service
                            
                                Deserializing fails for a class implementing Collection with Jackson
                            
                                Is there any sense in access modifiers for fields of the private inner class?
                            
                                Send String from Java to Arduino (simple example)
                            
                                Spring Security custom method with path variable and ant matcher
                            
                                Maven - Install project dependencies without building
                            
                                Validate soap requests against schema in a JAX-WS code-first approach
                            
                                java implict long cast behaviour
                            
                                java.lang.NoSuchMethodError: org.springframework.beans.factory.support.DefaultListableBeanFactory.getDependencyComparator()Ljava/util/Comparator;"}}
                            
                                How to chain multiple RxJava's groupBy() methods such as groupBy().groupBy()
                            
                                What is the analogon of Mockito.spy/doReturn in EasyMock?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Single Chinese character determined as length 2 in Java/Scala String

Tags:

java

character-encoding

scala

pishen

People also ask

1 Answers

Marko Topolnik

Recent Activity

Donate For Us