Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is comparing two byte[] of utf-8 encoded strings the same as comparing two unicode strings?

Tags:

c#

unicode

I found this in the wikipedia article on utf-8:

Sorting of UTF-8 strings as arrays of unsigned bytes will produce the same results as sorting them based on Unicode code points.

That would lead me to believe that for comparison purposes (sorting, binary search, etc) that comparing two byte arrays (i.e. byte-by-byte like memcmp) of utf-8 encoded strings would give the same results as comparing the actual unicode strings.

Is this true?

like image 403
Eloff Avatar asked Aug 13 '10 16:08

Eloff


2 Answers

All of the other answers discuss either proper/complicated Unicode comparison, or code point comparison.

However, there is another type of comparison you may care about, which is code unit comparison. This is the type used a lot in web platform specifications, for example. And I would expect it to show up in other "WTF-16" contexts like Win32 APIs, Java, and C#.

Code unit comparison is not equivalent to bytewise UTF-8 comparison, because of unpaired surrogate code units. A proper Unicode string (i.e. sequence of code points) cannot contain unpaired surrogates; all surrogate code units are part of a pair, which together make up a single code point. But many languages like JavaScript, Java, and C# will allow such unpaired surrogates. We call the strings in those languages "WTF-16 strings".

For strings containing unpaired surrogate, UTF-8 byte-wise comparison will not sort the same as code unit comparison.

U+D800 should sort after U+10002

since these decode in WTF-16 to the code units

0xFF61 > 0xD800 0xDC02

but the UTF-8 byte order comparison matches the code point order:

0xEF 0xBD 0xA1 < 0xF0 0x90 0x80 0x81

So, to conclude: if for some reason, such as matching web standards, you need code unit ordering instead of code point ordering, you cannot simply compare the UTF-8 bytes. This page from the ICU project has some more background.

like image 177
Domenic Avatar answered Oct 23 '22 05:10

Domenic


Yes, given that there's a one-to-one mapping between sequences bytes in UTF-8 encoding and Unicode code points.

However, there are way to compare Unicode strings besides looking at the raw code points. If you just look at code points -- or UTF-8 bytes -- as numbers then you miss culture-specific comparison logic.

To implement comparison and sorting correctly for a specific culture, on .NET, you should use the standard string comparison functions.

like image 22
Tim Robinson Avatar answered Oct 23 '22 04:10

Tim Robinson