Anyone can explain the "same thing" issue of UTF-8?

Question

Quoted from here:

Security may also be impacted by a characteristic of several character encodings, including UTF-8: the "same thing" (as far as a user can tell) can be represented by several distinct character sequences. For instance, an e with acute accent can be represented by the precomposed U+00E9 E ACUTE character or by the canonically equivalent sequence U+0065 U+0301 (E + COMBINING ACUTE). Even though UTF-8 provides a single byte sequence for each character sequence, the existence of multiple character sequences for "the same thing" may have security consequences whenever string matching, indexing,

Is this a hidden feature of UTF-8 that I've never tackled before?

Joachim Sauer · Accepted Answer

This issue is not actually specific to UTF-8 at all. It happens with all encodings that can represent all (or at least most) Unicode codepoints.

The general idea of Unicode is to not provide so-called pre-composed characters (e.g. U+00E9 E ACUTE), instead they usually like to provide the base character (e.g. U+0065 LATIN SMALL LETTER E) and the combining character (e.g. U+0301 COMBINING ACUTE ACCENT). This has the advantage of not having to provide every possible combination as its own character.

Note: the U+xxxx notation is used to refer to unicode codepoints. It's the encoding-independent way to refer to Unicode characters.

However when Unicode was first designed an important goal was to have round-trip compatibility for existing, widely-used encodings, so some pre-composed characters were included (in fact most of the diacritic characters from the latin and related alphabets are included).

So yes (and tl;dr): in a correctly working Unicode-capable application U+00E9 should render the same way and be treated the same way as U+0065 followed by U+0301.

There's a non-trivial process called normalization that helps work with these differences by reducing a given string to one of four normal forms.

For example passing both strings (U+00E9 and U+0065 U+0301) will result in U+00E9 when using NFC and will result in U+0065 U+0301 when using NFD.

deceze · Answer

Very short and visualized example: the character "é" can either be represented using the Unicode code point U+00E9 (LATIN SMALL LETTER E WITH ACUTE, é), or the sequence U+0065 (LATIN SMALL LETTER E, e) followed by U+0301 (COMBINING ACUTE ACCENT, ´), which together look like this: é.

In UTF-8, é has the byte sequence xC3 xA9, while é has the byte sequence x65 xCC x81.

_{Note: Due to technical limitations this post does not contain the actual combination characters.}

Anyone can explain the "same thing" issue of UTF-8?

Tags:

unicode

utf-8

normalization

combining-marks

new_perl

2 Answers

Joachim Sauer

deceze

Recent Activity

Donate For Us

Anyone can explain the "same thing" issue of UTF-8?

Tags:

unicode

utf-8

normalization

combining-marks

new_perl

2 Answers

Joachim Sauer

deceze

Related questions

Recent Activity

Donate For Us