TomC recommends decomposing Unicode characters on the way in, and recomposing on the way out (http://www.perl.com/pub/2012/04/perl-unicode-cookbook-always-decompose-and-recompose.html).
The former makes perfect sense to me, but I can't see why he recommends recomposing on the way out. Potentially you could save a small amount of space if your text is heavy with European accented characters, but you're just pushing that on to someone else's decomposition function.
Are there any other obvious reasons I'm missing?
Essentially, the Unicode Normalization Algorithm puts all combining marks in a specified order, and uses rules for decomposition and composition to transform each string into one of the Unicode Normalization Forms. A binary comparison of the transformed strings will then determine equivalence.
The standard also defines a text normalization procedure, called Unicode normalization, that replaces equivalent sequences of characters so that any two texts that are equivalent will be reduced to the same sequence of code points, called the normalization form or normal form of the original text.
As Ven'Tatsu writes in a comment, there is software that can handle composed characters but not decomposed characters. Though the opposite is theoretically possible too, I have never seen it in practice and expect it to be rare.
To just display a decomposed character, the rendering software needs to deal with combining diacritic marks. It does not suffice to find them in the font. The renderer needs to position the diacritic properly, using information about the dimensions of the base character. There are often problems with this, resulting in poor rendering—especially if the rendering uses the diacritic from a different font! The result can hardly be better than what is achieved by simply displaying the glyph of a precomposed character like “é”, designed by a typographer.
(Rendering software can also analyze the situation and effectively map the decomposed character to a precomposed character. But that would require extra code.)
It's quite simple: Most tools have limited Unicode support; they assume characters are in the NFC form.
For example, this is commonly how people compare strings:
perl -CSDA -e"use utf8; if ($ARGV[0] eq "Éric") { ... }"
And of course, the "É" is in NFC form (since that's what almost everything produces), so this program only accepts arguments in NFC form.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With