According to JavaScript - The Definitive guide,
JavaScript assumes that the source code it is interpreting has already been normalized and makes no attempt to normalize identifiers, strings, or regular expressions itself.
The Unicode standard defines the preferred encoding for all characters and specifies a normalization procedure to convert text to a canonical form suitable for comparisons.
If JS does not normalize Unicode then who does it and when?
If JavaScript does not normalize Unicode, then how is
"café" === "caf\u00e9" // => true
and why is
"café" === "cafe\u0301" // => false
Since both (\u00e9
and e\u0301
) are Unicode ways to form é.
Your applications can perform Unicode normalization using several algorithms, called "normalization forms," that obey different rules. The Unicode Consortium has defined four normalization forms: NFC (form C), NFD (form D), NFKC (form KC), and NFKD (form KD). Each form eliminates some differences but preserves case.
For most purposes on Windows, form C is the preferred form. For example, characters in form C are produced by Windows keyboard input. However, characters imported from the Web and other platforms can introduce other normalization forms into the data stream.
For example, the text string "a&#xnnnn;" (where nnnn = "0301") is Unicode-normalized since it consists only of ASCII characters, but it is not W3C-normalized, since it contains a representation of a combining acute accent with "a", and in normalization form C, that should have been normalized to U+00E1. [JC]
For loose matching, programs may want to use the normalization forms NFKC and NFKD, which remove compatibility distinctions. These two latter normalization forms, however, do lose information and are thus most appropriate for a restricted domain such as identifiers. For more information, see UAX #15, Unicode Normalization Forms.
You are confusing unicode normalization and string escaping.
"café"
…is the string made of characters with code points 0x63, 0x61, 0x66, 0xe9.
You can get the exact same string by using the escaped representation
"caf\u00e9"
// or even
"\u0063\u0061\u0066\u00e9"
// or why not
"\u0063\u0061fé"
When reading such string, javascript un-escapes the string. That is, it replaces the escape sequence by the matching characters. It is the exact same process that replaces "\n" with a new line.
Now, your second example is actually another string since it is not normalized. It is a string made of characters 0x63, 0x61, 0x66, 0x65, 0x301. As no normalization happens, it is not the same string.
Now try with the same string, using that sequence, which you cannot type with your keyboard, but that I copy-paste here for you: "café"
. Test it now:
> a = "café" // this one is copy-pasted with the combining acute
> b = "café" // this one is typed using the "é" key on my keyboard
> a === "cafe\u0301"
<- true
> b === "cafe\u0301"
<- false
> a === "caf\u00e9"
<- false
> b === "caf\u00e9"
<- true
> a === b
<- false
// Now just making sure...
> a.length
<- 5
> b.length
<- 4
The fact that "café" and "café" are rendered the same does not make them the same string. JavaScript compares the strings, finds that 0x63, 0x61, 0x66, 0xe9
is not the same as 0x63, 0x61, 0x66, 0x65, 0x301
and returns false.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With