Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why should you recompose Unicode (NFC) on the way out?

Tags:

unicode

perl

TomC recommends decomposing Unicode characters on the way in, and recomposing on the way out (http://www.perl.com/pub/2012/04/perl-unicode-cookbook-always-decompose-and-recompose.html).

The former makes perfect sense to me, but I can't see why he recommends recomposing on the way out. Potentially you could save a small amount of space if your text is heavy with European accented characters, but you're just pushing that on to someone else's decomposition function.

Are there any other obvious reasons I'm missing?

like image 665
petersergeant Avatar asked Apr 04 '12 13:04

petersergeant


People also ask

What does Unicode normalize do?

Essentially, the Unicode Normalization Algorithm puts all combining marks in a specified order, and uses rules for decomposition and composition to transform each string into one of the Unicode Normalization Forms. A binary comparison of the transformed strings will then determine equivalence.

What is a Unicode normalisation Form?

The standard also defines a text normalization procedure, called Unicode normalization, that replaces equivalent sequences of characters so that any two texts that are equivalent will be reduced to the same sequence of code points, called the normalization form or normal form of the original text.


2 Answers

As Ven'Tatsu writes in a comment, there is software that can handle composed characters but not decomposed characters. Though the opposite is theoretically possible too, I have never seen it in practice and expect it to be rare.

To just display a decomposed character, the rendering software needs to deal with combining diacritic marks. It does not suffice to find them in the font. The renderer needs to position the diacritic properly, using information about the dimensions of the base character. There are often problems with this, resulting in poor rendering—especially if the rendering uses the diacritic from a different font! The result can hardly be better than what is achieved by simply displaying the glyph of a precomposed character like “é”, designed by a typographer.

(Rendering software can also analyze the situation and effectively map the decomposed character to a precomposed character. But that would require extra code.)

like image 62
Jukka K. Korpela Avatar answered Nov 15 '22 04:11

Jukka K. Korpela


It's quite simple: Most tools have limited Unicode support; they assume characters are in the NFC form.

For example, this is commonly how people compare strings:

perl -CSDA -e"use utf8; if ($ARGV[0] eq "Éric") { ... }"

And of course, the "É" is in NFC form (since that's what almost everything produces), so this program only accepts arguments in NFC form.

like image 32
ikegami Avatar answered Nov 15 '22 06:11

ikegami