Probably I am missing something, but I do not understand why Encoding.UTF8.GetString and Encoding.UTF8.GetBytes are not working as inverse transformation of each other?
In the following example the myOriginalBytes and asBytes are not equal, even their length is different. Could anyone explain what am I missing?
byte[] myOriginalBytes = GetRandomByteArray();
var asString = Encoding.UTF8.GetString(myOriginalBytes);
var asBytes = Encoding.UTF8.GetBytes(asString);
Encodes the characters in a specified String object into a sequence of bytes. Encodes the specified character span into the specified byte span.
The following example reads a UTF-8 encoded string from a binary file that is represented by a FileStream object. For files that are smaller than 2,048 bytes, it reads the contents of the entire file into a byte array and calls the GetString(Byte[], Int32, Int32) method to perform the decoding.
These methods differ in the number of bytes they need to store a character. UTF-8 encodes a character into a binary string of one, two, three, or four bytes. UTF-16 encodes a Unicode character into a string of either two or four bytes. This distinction is evident from their names.
GetBytes() method converts a string into a bytes array. The following code example converts a C# string into a byte array in Ascii format and prints the converted bytes to the console. The Encoding. GetString() method converts an array of bytes into a string.
They're inverses if you start with a valid UTF-8 byte sequence, but they're not if you just start with an arbitrary byte sequence.
Let's take a concrete and very simple example: a single byte, 0xff. That's not the valid UTF-8 encoding for any text. So if you have:
byte[] bytes = { 0xff };
string text = Encoding.UTF8.GetString(bytes);
... you'll end up with text
being a single character, U+FFFD, the "Unicode replacement character" which is used to indicate that there was an error decoding the binary data. You'll end up with that replacement character for any invalid sequence - so you'd get the same text if you started with 0x80 for example. Clearly if multiple binary inputs are decoded to the same textual output, it can't possibly be a fully-reversible transform.
If you have arbitrary binary data, you should not use Encoding
to get text from it - you should use Convert.ToBase64String
or maybe hex. Encoding
is for data that is naturally textual.
If you go in the opposite direction, like this:
string text = GetRandomText();
byte[] bytes = Encoding.UTF8.GetBytes(text);
string text2 = Encoding.UTF8.GetString(bytes);
... I'd expect text2
to be equal to text
with the exception of odd situations where you've got invalid text to start with, e.g. with "half" a surrogate pair.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With