Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can someone explain Encoding.Unicode.GetBytes("hello") for me?

Tags:

unicode

My code:

        string input1;

        input1 = Console.ReadLine();

        Console.WriteLine("byte output");

        byte[] bInput1 = Encoding.Unicode.GetBytes(input1);


        for (int x = 0; x < bInput1.Length; x++)
            Console.WriteLine("{0} = {1}", x, bInput1[x]);

outputs:

104 0 101 0 108 0 108 0 111 0

for the input "hello"

Is there a reference to the character map where I can make sense of this?


2 Answers

You should read "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" at http://www.joelonsoftware.com/articles/Unicode.html

You can find a list of all Unicode characters at http://www.unicode.org but don't expect to be able to read the files there without learning a lot about text encoding issues.

like image 55
Nir Avatar answered Apr 12 '26 12:04

Nir


At http://www.unicode.org/charts/ you can find all the Unicode code charts. http://www.unicode.org/charts/PDF/U0000.pdf shows that the code point for 'h' is U+0068. (Another great tool for viewing this data is BabelMap.)

The exact details of UTF-16 encoding can be found at http://unicode.org/faq/utf_bom.html#6 and http://www.ietf.org/rfc/rfc2781.txt. In short, U+0068 is encoded (in UTF-16LE) as 0x68 0x00. In decimal, this is the first two bytes you see: 104 0.

The other characters are encoded similarly.

Finally, a great reference (when trying to understand the various Unicode specifications), apart from the Unicode Standard itself, is the Unicode Glossary.

like image 20
Bradley Grainger Avatar answered Apr 12 '26 13:04

Bradley Grainger