Why isn't `Encoding.UTF8.GetBytes(Encoding.UTF8.GetString(x))==x`

Question

In .NET why isn't it true that:

Encoding.UTF8.GetBytes(Encoding.UTF8.GetString(x))

returns the original byte array for an arbitrary byte array x?

It is mentioned in answer to another question but the responder doesn't explain why.

svick · Accepted Answer

First, as watbywbarif mentioned, you shouldn't compare sequences by using ==, that doesn't work.

But even if you compare the arrays correctly (e.g. by using SequenceEquals() or just by looking at them), they aren't always the same. One case where this can occur is if x is an invalid UTF-8 encoded string.

For example, the 1-byte sequence of 0xFF is not valid UTF-8. So what does Encoding.UTF8.GetString(new byte[] { 0xFF }) return? It's �, U+FFFD, REPLACEMENT CHARACTER. And of course, if you call Encoding.UTF8.GetBytes() on that, it doesn't give you back 0xFF.

sehe · Answer

Character encodings (UTF8, specificly) may have different forms for the same code point.

So when you convert to a string and back, the actual bytes may represent a different (canonical) form.

See also String.Normalize(NormalizationForm.System.Text.NormalizationForm.FormD)

See also:

Can I get a single canonical UTF-8 string from a Unicode string?
What does .NET's String.Normalize do?
NormalizationForm

Some Unicode sequences are considered equivalent because they represent the same character. For example, the following are considered equivalent because any of these can be used to represent "ắ":
"\u1EAF" 
"\u0103\u0301" 
"\u0061\u0306\u0301" 
However, ordinal, that is, binary, comparisons consider these sequences different because they contain different Unicode code values. Before performing ordinal comparisons, applications must normalize these strings to decompose them into their basic components.

That page comes with a nice sample that shows you what encodings are always normalized

Why isn't `Encoding.UTF8.GetBytes(Encoding.UTF8.GetString(x))==x`

Tags:

c#

utf-8

PyreneesJim

2 Answers

svick

sehe

Recent Activity

Donate For Us

Why isn't `Encoding.UTF8.GetBytes(Encoding.UTF8.GetString(x))==x`

Tags:

c#

utf-8

PyreneesJim

2 Answers

svick

sehe

Related questions

Recent Activity

Donate For Us