Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why isn't `Encoding.UTF8.GetBytes(Encoding.UTF8.GetString(x))==x`

Tags:

c#

utf-8

In .NET why isn't it true that:

Encoding.UTF8.GetBytes(Encoding.UTF8.GetString(x))

returns the original byte array for an arbitrary byte array x?

It is mentioned in answer to another question but the responder doesn't explain why.

like image 542
PyreneesJim Avatar asked Mar 16 '12 15:03

PyreneesJim


2 Answers

First, as watbywbarif mentioned, you shouldn't compare sequences by using ==, that doesn't work.

But even if you compare the arrays correctly (e.g. by using SequenceEquals() or just by looking at them), they aren't always the same. One case where this can occur is if x is an invalid UTF-8 encoded string.

For example, the 1-byte sequence of 0xFF is not valid UTF-8. So what does Encoding.UTF8.GetString(new byte[] { 0xFF }) return? It's �, U+FFFD, REPLACEMENT CHARACTER. And of course, if you call Encoding.UTF8.GetBytes() on that, it doesn't give you back 0xFF.

like image 190
svick Avatar answered Oct 31 '22 17:10

svick


Character encodings (UTF8, specificly) may have different forms for the same code point.

So when you convert to a string and back, the actual bytes may represent a different (canonical) form.

See also String.Normalize(NormalizationForm.System.Text.NormalizationForm.FormD)

See also:

  • Can I get a single canonical UTF-8 string from a Unicode string?
  • What does .NET's String.Normalize do?
  • NormalizationForm

Some Unicode sequences are considered equivalent because they represent the same character. For example, the following are considered equivalent because any of these can be used to represent "ắ":

"\u1EAF" 
"\u0103\u0301" 
"\u0061\u0306\u0301" 

However, ordinal, that is, binary, comparisons consider these sequences different because they contain different Unicode code values. Before performing ordinal comparisons, applications must normalize these strings to decompose them into their basic components.

That page comes with a nice sample that shows you what encodings are always normalized

like image 45
sehe Avatar answered Oct 31 '22 17:10

sehe