Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What component handles a Combining Diaeresis in a string?

I am working a list of file names in Java.

I observe that some single characters in the file names, like a, ö and ü actually consist of a sequence you could describe as two single ASCII chars following up:

ö is represented by o, ¨

I see this by inspection with codePointAt(). The German name "Rölli" is in fact "Ro¨lli":

...
20: R, 82
21: o, 111
22: ̈, 776
23: l, 108
24: l, 108
25: i, 105
...

The character ¨ in the log above has the value 776, which is a "Combining Diaeresis". This is a so called combining mark that belongs to the graphemes, or more precisely to the combining diacritics. So it all makes sense, but I do not understand what software component combines the two characters to one umlaut, and where this behavior is specified.

  • It has nothing to do with the fact that powerful character code tables use several bytes as internal representation. Several bytes are not the same as two combining characters.
  • Any simple print() of the string shows me the combined character, so it is neither some UI layer above.
  • I remember to have observed this also with PHP. I guess any modern language can handle this.

What component causes combining characters to be displayed as single combined characters? How reliable is all this?

Has Java a normalization method that makes single code points of combined code points, like here? Would be a help for using Regex...

Thanks a lot for any hint.

like image 489
peter_the_oak Avatar asked Nov 04 '15 10:11

peter_the_oak


People also ask

What is a diaeresis?

A diaeresis is used when you have two vowels next to one another that should be pronounced as separate syllables instead jumbled together as a diphthong. The word “naïve” is a good example.

What is the difference between Siyame and diaeresis?

The sign is used especially when no vowel marks are present, which could differentiate between the two forms. Although the origin of the Siyame is different from that of the diaeresis sign, in modern computer systems both are represented by the same Unicode character. This, however, often leads to wrong rendering of the Syriac text.

What languages use the diaeresis?

The diaeresis was borrowed for this purpose in several languages of western and southern Europe, among them Occitan, Catalan, French, Dutch, Welsh, and (rarely) English. As a further extension, some languages began to use a diaeresis whenever a vowel letter was to be pronounced separately.

What is the difference between diaeresis and diphthongs?

The Occitan use of diaeresis is very similar to that of Catalan: ai, ei, oi, au, eu, ou are diphthongs consisting of one syllable but aï, eï, oï, aü, eü, oü are groups consisting of two distinct syllables.


1 Answers

Answer 1: Specification and responsibility

The behavior you describe is defined in Unicode Standard Annex #15, Unicode Normalization Forms. This is about the equivalency of combined chars and single code points and about the decomposition of code points. Many languages other then German heavily rely on composing graphemes.

Java internally represents strings as UTF-16. So all it does with its String class is delivering UTF-16 code chains to other components. It is up to the surrounding software (e.g. any kind of text view components) to combine the chains correctly. You feel this in moments where e.g. a regex breaks your combined ö apart, yet it is shown correctly in some view.

By the way, if you do some experiments with the Combining Diaeresis, be aware that there is also a "non-functional" code 168, which is a simple ASCII character called "Spacing Diaeresis". Code 168 does not cause any software to combining two code points to one. For this you need the Unicode 776.

Answer 2: Javas normalization method

Basically, you should always take combined chars into account - except you are sure that your data source cannot deliver them. It's a good idea to sanitize your strings first.

Look for unicode normalizing methods in your language, as they release you from fiddling with single replace() statements and they contain a lot of experience.

Java has a Normalizerobject that deals with different representations of combined characters:

https://docs.oracle.com/javase/7/docs/api/java/text/Normalizer.html

and the tutorial for it: https://docs.oracle.com/javase/tutorial/i18n/text/normalizerapi.html

So after invoking this code line:

String normalized = Normalizer.normalize(someFileName, Normalizer.Form.NFC);

the log print from the question above looks like this:

...
19:  , 32
20: R, 82
21: ö, 246   <<< here were two combined chars before normalize()
22: l, 108
23: l, 108
24: i, 105
...
like image 58
peter_the_oak Avatar answered Sep 30 '22 04:09

peter_the_oak