Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C#: bytes to UTF-8 string conversion. Why doesn't it work?

There is a Chinese character 𤭢 which is presented in UTF-8 as F0 A4 AD A2. This character is described here: http://en.wikipedia.org/wiki/UTF-8

𤭢 U+24B62 F0 A4 AD A2

When I run this code in C# ...

byte[] data = { 0xF0, 0xA4, 0xAD, 0xA2 };
string abc = Encoding.UTF8.GetString(data);
Console.WriteLine("Test: description = {0}", abc);

... I redirect the output to the text file and then open it with notepad.exe choosing UTF-8 encoding. I expect to get 𤭢 in the output, but do get two question marks (??).

The byte sequence is right. It works in Perl:

print "\xF0\xA4\xAD\xA2";

In the output, I get 𤭢

So my question is: why do I get "??" instead of "𤭢" in C#?

P.S. Nothing special with this character: I got the same thing for any character (2, 3 or 4 byte long).

like image 753
Racoon Avatar asked Mar 04 '13 16:03

Racoon


2 Answers

Console can't display Unicode characters by default. It displays only ASCII. To enable it display Unicode, use:

Console.OutputEncoding = System.Text.Encoding.Unicode

before writing to it.

But anyway it will fail on most OS, because Windows Command line doesn't support Unicode itself.

So, for testing purpose it would be better to write output to file

like image 53
Sasha Avatar answered Sep 30 '22 21:09

Sasha


You need to write to a file using UTF8. The code below shows how you may do it. When opening the resulting file in Notepad, the character 𤭢 is shown correctly:

string c = "𤭢";
var bytes = Encoding.UTF8.GetBytes(c);
var cBack = Encoding.UTF8.GetString(bytes);
using (var writer = new StreamWriter(@"c:\temp\char.txt", false, Encoding.UTF8))
{
    writer.WriteLine(cBack);
}
like image 33
Jakob Christensen Avatar answered Sep 30 '22 21:09

Jakob Christensen