Quoted from here:
Security may also be impacted by a characteristic of several character encodings, including UTF-8: the "same thing" (as far as a user can tell) can be represented by several distinct character sequences. For instance, an e with acute accent can be represented by the precomposed U+00E9 E ACUTE character or by the canonically equivalent sequence U+0065 U+0301 (E + COMBINING ACUTE). Even though UTF-8 provides a single byte sequence for each character sequence, the existence of multiple character sequences for "the same thing" may have security consequences whenever string matching, indexing,
Is this a hidden feature of UTF-8 that I've never tackled before?
This issue is not actually specific to UTF-8 at all. It happens with all encodings that can represent all (or at least most) Unicode codepoints.
The general idea of Unicode is to not provide so-called pre-composed characters (e.g. U+00E9 E ACUTE), instead they usually like to provide the base character (e.g. U+0065 LATIN SMALL LETTER E) and the combining character (e.g. U+0301 COMBINING ACUTE ACCENT). This has the advantage of not having to provide every possible combination as its own character.
Note: the U+xxxx notation is used to refer to unicode codepoints. It's the encoding-independent way to refer to Unicode characters.
However when Unicode was first designed an important goal was to have round-trip compatibility for existing, widely-used encodings, so some pre-composed characters were included (in fact most of the diacritic characters from the latin and related alphabets are included).
So yes (and tl;dr): in a correctly working Unicode-capable application U+00E9 should render the same way and be treated the same way as U+0065 followed by U+0301.
There's a non-trivial process called normalization that helps work with these differences by reducing a given string to one of four normal forms.
For example passing both strings (U+00E9 and U+0065 U+0301) will result in U+00E9 when using NFC and will result in U+0065 U+0301 when using NFD.
Very short and visualized example: the character "é" can either be represented using the Unicode code point U+00E9 (LATIN SMALL LETTER E WITH ACUTE, é), or the sequence U+0065 (LATIN SMALL LETTER E, e) followed by U+0301 (COMBINING ACUTE ACCENT, ´), which together look like this: é.
In UTF-8, é has the byte sequence xC3 xA9, while é has the byte sequence x65 xCC x81.
Note: Due to technical limitations this post does not contain the actual combination characters.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With