Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the reason that Encoding.UTF8.GetString and Encoding.UTF8.GetBytes are not inverse of each other?

Tags:

c#

.net

utf-8

Probably I am missing something, but I do not understand why Encoding.UTF8.GetString and Encoding.UTF8.GetBytes are not working as inverse transformation of each other?

In the following example the myOriginalBytes and asBytes are not equal, even their length is different. Could anyone explain what am I missing?

byte[] myOriginalBytes = GetRandomByteArray();
var asString = Encoding.UTF8.GetString(myOriginalBytes);
var asBytes = Encoding.UTF8.GetBytes(asString);
like image 967
g.pickardou Avatar asked Jul 31 '17 07:07

g.pickardou


People also ask

What does encoding utf8 GetBytes do?

Encodes the characters in a specified String object into a sequence of bytes. Encodes the specified character span into the specified byte span.

What is encoding utf8 GetString?

The following example reads a UTF-8 encoded string from a binary file that is represented by a FileStream object. For files that are smaller than 2,048 bytes, it reads the contents of the entire file into a byte array and calls the GetString(Byte[], Int32, Int32) method to perform the decoding.

What is the opposite of UTF-8?

These methods differ in the number of bytes they need to store a character. UTF-8 encodes a character into a binary string of one, two, three, or four bytes. UTF-16 encodes a Unicode character into a string of either two or four bytes. This distinction is evident from their names.

What does GetBytes do in c#?

GetBytes() method converts a string into a bytes array. The following code example converts a C# string into a byte array in Ascii format and prints the converted bytes to the console. The Encoding. GetString() method converts an array of bytes into a string.


1 Answers

They're inverses if you start with a valid UTF-8 byte sequence, but they're not if you just start with an arbitrary byte sequence.

Let's take a concrete and very simple example: a single byte, 0xff. That's not the valid UTF-8 encoding for any text. So if you have:

byte[] bytes = { 0xff };
string text = Encoding.UTF8.GetString(bytes);

... you'll end up with text being a single character, U+FFFD, the "Unicode replacement character" which is used to indicate that there was an error decoding the binary data. You'll end up with that replacement character for any invalid sequence - so you'd get the same text if you started with 0x80 for example. Clearly if multiple binary inputs are decoded to the same textual output, it can't possibly be a fully-reversible transform.

If you have arbitrary binary data, you should not use Encoding to get text from it - you should use Convert.ToBase64String or maybe hex. Encoding is for data that is naturally textual.

If you go in the opposite direction, like this:

string text = GetRandomText();
byte[] bytes = Encoding.UTF8.GetBytes(text);
string text2 = Encoding.UTF8.GetString(bytes);

... I'd expect text2 to be equal to text with the exception of odd situations where you've got invalid text to start with, e.g. with "half" a surrogate pair.

like image 82
Jon Skeet Avatar answered Oct 07 '22 16:10

Jon Skeet