Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which is the better Unicode Normalization Form?

I have four options on Dreamweaver: C, D, KC, KD. Which one should I choose and why?

like image 291
Miki Avatar asked Mar 22 '11 10:03

Miki


People also ask

What is Unicode normalization form?

Essentially, the Unicode Normalization Algorithm puts all combining marks in a specified order, and uses rules for decomposition and composition to transform each string into one of the Unicode Normalization Forms. A binary comparison of the transformed strings will then determine equivalence.

What on earth is Unicode Normalization?

Unicode normalization is our solution to both canonical and compatibility equivalence issues. In normalization, there are two directions and two types of conversions we can make. The two types we have already covered, canonical and compatibility.

What is NFD normalization?

NFD. Normalization Form Canonical Decomposition. Characters are decomposed by canonical equivalence, and multiple combining characters are arranged in a specific order.

What is Unicode Normalization in Python?

normalize (form, unistr) Return the normal form form for the Unicode string unistr. Valid values for form are 'NFC', 'NFKC', 'NFD', and 'NFKD'. The Unicode standard defines various normalization forms of a Unicode string, based on the definition of canonical equivalence and compatibility equivalence.


1 Answers

For what? Saving a file, use NFC as the web character model uses it (strictly, the W3C normalisation insists that both the stream be in NFC and also that when entities in HTML or XML are converted to the characters they represent, that it is still in NFC). The odds that it'll ever make a practical difference are slim, though it could stop a few rather obscure issues upsetting someone down the line.

Normalisation makes certain equivalent sequences result in identical streams. For example, U+0065 (e) followed by U+0301 (a combining acute accent) is equivalent to U+00E9 (é) on its own.

NFD splits all such strings up into their component parts (e.g. turning U+00E9 into U+0065 followed by U+0301). If there are two or more combining characters in a row, they are re-ordered according to rules that give a consistency (ḉ could have the cedilla followed by the accute or the accute followed by the cedilla, and we need a consistent ordering to have the same string produced). Mostly NFD is useful for internal processing as part of another task, such as stripping accents, or producing NFC.

NFC starts with NFD and then combines the characters together again where possible, barring a few exceptions to ensure that what was a normalised string with one version of Unicode remains so with another.

NFKD goes further than NFD in replacing certain similar characters with each other. ⁵ for example is replaced with 5. This "damages" the text (a user may reasonably choose ⁵ over 5 for a good reason) but is useful for searching (search for "fiſh" on google and it returns results for "fish" because it treats the long-s the same as a short-s) and as a restriction in certain cases to avoid security issues with similar but different characters. NKFC first does NFKD and then combines in the same manner as NFC.

http://unicode.org/reports/tr15/ for the full skinny, and "use NFC but don't worry about it" to repeat the short answer.

like image 179
Jon Hanna Avatar answered Oct 10 '22 21:10

Jon Hanna