There is a Chinese character 𤭢 which is presented in UTF-8 as F0 A4 AD A2. This character is described here: http://en.wikipedia.org/wiki/UTF-8
𤭢 U+24B62 F0 A4 AD A2
When I run this code in C# ...
byte[] data = { 0xF0, 0xA4, 0xAD, 0xA2 };
string abc = Encoding.UTF8.GetString(data);
Console.WriteLine("Test: description = {0}", abc);
... I redirect the output to the text file and then open it with notepad.exe choosing UTF-8 encoding. I expect to get 𤭢 in the output, but do get two question marks (??).
The byte sequence is right. It works in Perl:
print "\xF0\xA4\xAD\xA2";
In the output, I get 𤭢
So my question is: why do I get "??" instead of "𤭢" in C#?
P.S. Nothing special with this character: I got the same thing for any character (2, 3 or 4 byte long).
Console can't display Unicode characters by default. It displays only ASCII. To enable it display Unicode, use:
Console.OutputEncoding = System.Text.Encoding.Unicode
before writing to it.
But anyway it will fail on most OS, because Windows Command line doesn't support Unicode itself.
So, for testing purpose it would be better to write output to file
You need to write to a file using UTF8. The code below shows how you may do it. When opening the resulting file in Notepad, the character 𤭢 is shown correctly:
string c = "𤭢";
var bytes = Encoding.UTF8.GetBytes(c);
var cBack = Encoding.UTF8.GetString(bytes);
using (var writer = new StreamWriter(@"c:\temp\char.txt", false, Encoding.UTF8))
{
writer.WriteLine(cBack);
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With