In .NET why isn't it true that:
Encoding.UTF8.GetBytes(Encoding.UTF8.GetString(x))
returns the original byte array for an arbitrary byte array x
?
It is mentioned in answer to another question but the responder doesn't explain why.
First, as watbywbarif mentioned, you shouldn't compare sequences by using ==
, that doesn't work.
But even if you compare the arrays correctly (e.g. by using SequenceEquals()
or just by looking at them), they aren't always the same. One case where this can occur is if x
is an invalid UTF-8 encoded string.
For example, the 1-byte sequence of 0xFF
is not valid UTF-8. So what does Encoding.UTF8.GetString(new byte[] { 0xFF })
return? It's �, U+FFFD, REPLACEMENT CHARACTER. And of course, if you call Encoding.UTF8.GetBytes()
on that, it doesn't give you back 0xFF
.
Character encodings (UTF8, specificly) may have different forms for the same code point.
So when you convert to a string and back, the actual bytes may represent a different (canonical) form.
See also String.Normalize(NormalizationForm.System.Text.NormalizationForm.FormD)
See also:
Some Unicode sequences are considered equivalent because they represent the same character. For example, the following are considered equivalent because any of these can be used to represent "ắ":
"\u1EAF" "\u0103\u0301" "\u0061\u0306\u0301"
However, ordinal, that is, binary, comparisons consider these sequences different because they contain different Unicode code values. Before performing ordinal comparisons, applications must normalize these strings to decompose them into their basic components.
That page comes with a nice sample that shows you what encodings are always normalized
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With