Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Who performs unicode normalization and when?

According to JavaScript - The Definitive guide,

JavaScript assumes that the source code it is interpreting has already been normalized and makes no attempt to normalize identifiers, strings, or regular expressions itself.

The Unicode standard defines the preferred encoding for all characters and specifies a normalization procedure to convert text to a canonical form suitable for comparisons.

If JS does not normalize Unicode then who does it and when?

If JavaScript does not normalize Unicode, then how is

"café" === "caf\u00e9"   // => true

and why is

"café" === "cafe\u0301"   // => false

Since both (\u00e9 and e\u0301) are Unicode ways to form é.

like image 968
Harshit Juneja Avatar asked Jul 23 '17 18:07

Harshit Juneja


People also ask

What are the different Unicode normalization forms?

Your applications can perform Unicode normalization using several algorithms, called "normalization forms," that obey different rules. The Unicode Consortium has defined four normalization forms: NFC (form C), NFD (form D), NFKC (form KC), and NFKD (form KD). Each form eliminates some differences but preserves case.

What is the best normalization form to use on Windows?

For most purposes on Windows, form C is the preferred form. For example, characters in form C are produced by Windows keyboard input. However, characters imported from the Web and other platforms can introduce other normalization forms into the data stream.

What is an example of a Unicode-normalized string?

For example, the text string "a&#xnnnn;" (where nnnn = "0301") is Unicode-normalized since it consists only of ASCII characters, but it is not W3C-normalized, since it contains a representation of a combining acute accent with "a", and in normalization form C, that should have been normalized to U+00E1. [JC]

Which Normalization Form should I use for loose matching?

For loose matching, programs may want to use the normalization forms NFKC and NFKD, which remove compatibility distinctions. These two latter normalization forms, however, do lose information and are thus most appropriate for a restricted domain such as identifiers. For more information, see UAX #15, Unicode Normalization Forms.


1 Answers

You are confusing unicode normalization and string escaping.

"café"

…is the string made of characters with code points 0x63, 0x61, 0x66, 0xe9.

You can get the exact same string by using the escaped representation

"caf\u00e9"
// or even
"\u0063\u0061\u0066\u00e9"
// or why not
"\u0063\u0061fé"

When reading such string, javascript un-escapes the string. That is, it replaces the escape sequence by the matching characters. It is the exact same process that replaces "\n" with a new line.

Now, your second example is actually another string since it is not normalized. It is a string made of characters 0x63, 0x61, 0x66, 0x65, 0x301. As no normalization happens, it is not the same string.

Now try with the same string, using that sequence, which you cannot type with your keyboard, but that I copy-paste here for you: "café". Test it now:

> a = "café"     // this one is copy-pasted with the combining acute
> b = "café"     // this one is typed using the "é" key on my keyboard
> a === "cafe\u0301"
<- true
> b === "cafe\u0301"
<- false
> a === "caf\u00e9"
<- false
> b === "caf\u00e9"
<- true
> a === b
<- false
// Now just making sure...
> a.length
<- 5
> b.length
<- 4

The fact that "café" and "café" are rendered the same does not make them the same string. JavaScript compares the strings, finds that 0x63, 0x61, 0x66, 0xe9 is not the same as 0x63, 0x61, 0x66, 0x65, 0x301 and returns false.

like image 109
spectras Avatar answered Oct 04 '22 09:10

spectras