I am working a list of file names in Java.
I observe that some single characters in the file names, like a, ö and ü actually consist of a sequence you could describe as two single ASCII chars following up:
ö
is represented by o
, ¨
I see this by inspection with codePointAt()
. The German name "Rölli" is in fact "Ro¨lli":
...
20: R, 82
21: o, 111
22: ̈, 776
23: l, 108
24: l, 108
25: i, 105
...
The character ¨
in the log above has the value 776, which is a "Combining Diaeresis". This is a so called combining mark that belongs to the graphemes, or more precisely to the combining diacritics. So it all makes sense, but I do not understand what software component combines the two characters to one umlaut, and where this behavior is specified.
print()
of the string shows me the combined character, so it is neither some UI layer above. What component causes combining characters to be displayed as single combined characters? How reliable is all this?
Has Java a normalization method that makes single code points of combined code points, like here? Would be a help for using Regex...
Thanks a lot for any hint.
A diaeresis is used when you have two vowels next to one another that should be pronounced as separate syllables instead jumbled together as a diphthong. The word “naïve” is a good example.
The sign is used especially when no vowel marks are present, which could differentiate between the two forms. Although the origin of the Siyame is different from that of the diaeresis sign, in modern computer systems both are represented by the same Unicode character. This, however, often leads to wrong rendering of the Syriac text.
The diaeresis was borrowed for this purpose in several languages of western and southern Europe, among them Occitan, Catalan, French, Dutch, Welsh, and (rarely) English. As a further extension, some languages began to use a diaeresis whenever a vowel letter was to be pronounced separately.
The Occitan use of diaeresis is very similar to that of Catalan: ai, ei, oi, au, eu, ou are diphthongs consisting of one syllable but aï, eï, oï, aü, eü, oü are groups consisting of two distinct syllables.
Answer 1: Specification and responsibility
The behavior you describe is defined in Unicode Standard Annex #15, Unicode Normalization Forms. This is about the equivalency of combined chars and single code points and about the decomposition of code points. Many languages other then German heavily rely on composing graphemes.
Java internally represents strings as UTF-16. So all it does with its String
class is delivering UTF-16 code chains to other components. It is up to the surrounding software (e.g. any kind of text view components) to combine the chains correctly. You feel this in moments where e.g. a regex breaks your combined ö
apart, yet it is shown correctly in some view.
By the way, if you do some experiments with the Combining Diaeresis, be aware that there is also a "non-functional" code 168, which is a simple ASCII character called "Spacing Diaeresis". Code 168 does not cause any software to combining two code points to one. For this you need the Unicode 776.
Answer 2: Javas normalization method
Basically, you should always take combined chars into account - except you are sure that your data source cannot deliver them. It's a good idea to sanitize your strings first.
Look for unicode normalizing methods in your language, as they release you from fiddling with single replace()
statements and they contain a lot of experience.
Java has a Normalizer
object that deals with different representations of combined characters:
https://docs.oracle.com/javase/7/docs/api/java/text/Normalizer.html
and the tutorial for it: https://docs.oracle.com/javase/tutorial/i18n/text/normalizerapi.html
So after invoking this code line:
String normalized = Normalizer.normalize(someFileName, Normalizer.Form.NFC);
the log print from the question above looks like this:
...
19: , 32
20: R, 82
21: ö, 246 <<< here were two combined chars before normalize()
22: l, 108
23: l, 108
24: i, 105
...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With