Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Char size in .net is not as expected?

size of char is : 2 (msdn)

sizeof(char)  //2

a test :

char[] c = new char[1] {'a'};

Encoding.UTF8.GetByteCount(c) //1 ?

why the value is 1?

(of course if c is a unicode char like 'ש' so it does show 2 as it should.)

a is not .net char ?

like image 221
Royi Namir Avatar asked Nov 29 '22 02:11

Royi Namir


1 Answers

It's because 'a' only takes one byte to encode in UTF-8.

Encoding.UTF8.GetByteCount(c) will tell you how many bytes it takes to encode the given array of characters in UTF-8. See the documentation for Encoding.GetByteCount for more details. That's entirely separate from how wide the char type is internally in .NET.

Each character with code points less than 128 (i.e. U+0000 to U+007F) takes a single byte to encode in UTF-8.

Other characters take 2, 3 or even 4 bytes in UTF-8. (There are values over U+1FFFF which would take 5 or 6 bytes to encode, but they're not part of Unicode at the moment, and probably never will be.)

Note that the only characters which take 4 bytes to encode in UTF-8 can't be encoded in a single char anyway. A char is a UTF-16 code unit, and any Unicode code points over U+FFFF require two UTF-16 code units forming a surrogate pair to represent them.

like image 168
Jon Skeet Avatar answered Dec 05 '22 01:12

Jon Skeet