What component handles a Combining Diaeresis in a string?

Tags:

I am working a list of file names in Java.

I observe that some single characters in the file names, like a, ö and ü actually consist of a sequence you could describe as two single ASCII chars following up:

ö is represented by o, ¨

I see this by inspection with codePointAt(). The German name "Rölli" is in fact "Ro¨lli":

...
20: R, 82
21: o, 111
22: ̈, 776
23: l, 108
24: l, 108
25: i, 105
...

The character ¨ in the log above has the value 776, which is a "Combining Diaeresis". This is a so called combining mark that belongs to the graphemes, or more precisely to the combining diacritics. So it all makes sense, but I do not understand what software component combines the two characters to one umlaut, and where this behavior is specified.

It has nothing to do with the fact that powerful character code tables use several bytes as internal representation. Several bytes are not the same as two combining characters.
Any simple print() of the string shows me the combined character, so it is neither some UI layer above.
I remember to have observed this also with PHP. I guess any modern language can handle this.

What component causes combining characters to be displayed as single combined characters? How reliable is all this?

Has Java a normalization method that makes single code points of combined code points, like here? Would be a help for using Regex...

Thanks a lot for any hint.

489

asked Nov 04 '15 10:11

peter_the_oak

1 Answers

Answer 1: Specification and responsibility

The behavior you describe is defined in Unicode Standard Annex #15, Unicode Normalization Forms. This is about the equivalency of combined chars and single code points and about the decomposition of code points. Many languages other then German heavily rely on composing graphemes.

Java internally represents strings as UTF-16. So all it does with its String class is delivering UTF-16 code chains to other components. It is up to the surrounding software (e.g. any kind of text view components) to combine the chains correctly. You feel this in moments where e.g. a regex breaks your combined ö apart, yet it is shown correctly in some view.

By the way, if you do some experiments with the Combining Diaeresis, be aware that there is also a "non-functional" code 168, which is a simple ASCII character called "Spacing Diaeresis". Code 168 does not cause any software to combining two code points to one. For this you need the Unicode 776.

Answer 2: Javas normalization method

Basically, you should always take combined chars into account - except you are sure that your data source cannot deliver them. It's a good idea to sanitize your strings first.

Look for unicode normalizing methods in your language, as they release you from fiddling with single replace() statements and they contain a lot of experience.

Java has a Normalizerobject that deals with different representations of combined characters:

https://docs.oracle.com/javase/7/docs/api/java/text/Normalizer.html

and the tutorial for it: https://docs.oracle.com/javase/tutorial/i18n/text/normalizerapi.html

So after invoking this code line:

String normalized = Normalizer.normalize(someFileName, Normalizer.Form.NFC);

the log print from the question above looks like this:

...
19:  , 32
20: R, 82
21: ö, 246   <<< here were two combined chars before normalize()
22: l, 108
23: l, 108
24: i, 105
...

answered Sep 30 '22 04:09

peter_the_oak

Related questions
                            
                                MongoDb equivalent of writer in Spring Batch?
                            
                                Hazelcast Ringbuffer readManyAsync returns Empty Results
                            
                                What design pattern does Java Executor framework implements?
                            
                                Strange issue while configuring ID types for Embeddable class in EclipseLink-2.5.2
                            
                                JBehave - run a single scenario
                            
                                How to refer to predefined database entries in source code
                            
                                Jaxb marshaller setproperty XSLT throws PropertyException
                            
                                Why Java Method Reference of instance method cannot be assigned to Consumer interface
                            
                                Is this interrupt() necessary?
                            
                                "override" super class member in subclass java
                            
                                Can Apache Daemon be used to restart a Java application if it is forcibly shutdown?
                            
                                Ignore/Turn off verification of the certificate in Jmeter/Java
                            
                                Failed to evaluate expression 'IS_AUTHENTICATED_ANONYMOUSLY' Spring 4
                            
                                Understanding DAO-pattern and interfaces
                            
                                What is @javax.ws.rs.core.Context
                            
                                Stream a video file over http with Spark Java
                            
                                Syntax error, insert "... VariableDeclaratorId" to complete FormalParameterList
                            
                                Why or condition is working differently compare with Java and SQL
                            
                                Spring boot embedded tomcat not loading SSL keystore file from classpath
                            
                                How to store enums in Realm?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What component handles a Combining Diaeresis in a string?

Tags:

java

string

character-encoding

unicode-normalization

combining-marks

peter_the_oak

People also ask

1 Answers

peter_the_oak

Recent Activity

Donate For Us