C#: bytes to UTF-8 string conversion. Why doesn't it work?

Question

There is a Chinese character 𤭢 which is presented in UTF-8 as F0 A4 AD A2. This character is described here: http://en.wikipedia.org/wiki/UTF-8

𤭢 U+24B62 F0 A4 AD A2

When I run this code in C# ...

byte[] data = { 0xF0, 0xA4, 0xAD, 0xA2 };
string abc = Encoding.UTF8.GetString(data);
Console.WriteLine("Test: description = {0}", abc);

... I redirect the output to the text file and then open it with notepad.exe choosing UTF-8 encoding. I expect to get 𤭢 in the output, but do get two question marks (??).

The byte sequence is right. It works in Perl:

print "\xF0\xA4\xAD\xA2";

In the output, I get 𤭢

So my question is: why do I get "??" instead of "𤭢" in C#?

P.S. Nothing special with this character: I got the same thing for any character (2, 3 or 4 byte long).

Sasha · Accepted Answer

Console can't display Unicode characters by default. It displays only ASCII. To enable it display Unicode, use:

Console.OutputEncoding = System.Text.Encoding.Unicode

before writing to it.

But anyway it will fail on most OS, because Windows Command line doesn't support Unicode itself.

So, for testing purpose it would be better to write output to file

Jakob Christensen · Answer

You need to write to a file using UTF8. The code below shows how you may do it. When opening the resulting file in Notepad, the character 𤭢 is shown correctly:

string c = "𤭢";
var bytes = Encoding.UTF8.GetBytes(c);
var cBack = Encoding.UTF8.GetString(bytes);
using (var writer = new StreamWriter(@"c:	emp\char.txt", false, Encoding.UTF8))
{
    writer.WriteLine(cBack);
}

C#: bytes to UTF-8 string conversion. Why doesn't it work?

Tags:

c#

character-encoding

hex

encoding

utf-8

Racoon

2 Answers

Sasha

Jakob Christensen

Recent Activity

Donate For Us

C#: bytes to UTF-8 string conversion. Why doesn't it work?

Tags:

c#

character-encoding

hex

encoding

utf-8

Racoon

2 Answers

Sasha

Jakob Christensen

Related questions

Recent Activity

Donate For Us