Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can Visual Studio handles U+20000 Unicode as char? How?

With some Unicode codes has more than one byte, can visual studios handle these characters? How?

http://www.unicode.org release below for CJK. Now one character could be more than one byte.

  • CJK Unified Ideographs Extension B (U+20000 through U+2A6D6)
  • CJK Unified Ideographs Extension C (U+2A700 through U+2B734)
  • CJK Unified Ideographs Extension D (U+2B740 through U+2B81D)
  • CJK Compatibility Ideographs Supplement (U+2F800 through U+2FA1D)

Below statement failed for me on Visual Studio 2012:

char ch = '\u2A6D6';

I have not tried on visual Studio 2013 / Visual Studio 2015 yet.

like image 522
Herbert Yu Avatar asked Feb 24 '26 13:02

Herbert Yu


2 Answers

This code-point doesn't fit into a char since char only has 16 bits and thus only supports code-points up to 65535. Characters outside the basic multilingual plane (BMP) can be encoded as two UTF-16 code-units in a string using surrogate pairs.

char.ConvertFromUtf32(0x2A6D6) returns a string with two chars, "\uD869\uDED6"


Code points U+10000 to U+10FFFF

Code points from the other planes (called Supplementary Planes) are encoded in UTF-16 by pairs of 16-bit code units called surrogate pairs, by the following scheme:

  • 0x010000 is subtracted from the code point, leaving a 20 bit number in the range 0..0x0FFFFF.
  • The top ten bits (a number in the range 0..0x03FF) are added to 0xD800 to give the first code unit or lead surrogate, which will be in the range 0xD800..0xDBFF. (Previous versions of the Unicode Standard referred to these as high surrogates.)
  • The low ten bits (also in the range 0..0x03FF) are added to 0xDC00 to give the second code unit or trail surrogate, which will be in the range 0xDC00..0xDFFF. (Previous versions of the Unicode Standard referred to these as low surrogates.)

from wikipedia - UTF-16

like image 174
CodesInChaos Avatar answered Feb 26 '26 03:02

CodesInChaos


Visual Studio should be able to handle them fine. Your code, however, is not legal in C#. As mentioned by @CodesInChaos, chars in .NET are UTF-16 code units, not Unicode code points. The \uxxxx escape sequence only allows 4 hex digits (2 bytes). In C#, you would generally use the \Uxxxxxxxx escape for code points above 0xFFFF, but do note that this escape sequence is translated into two surrogate UTF-16 code units (i.e. two .NET chars) so they can't be assigned to the char data type. If you need to use char, you would have to use the surrogates as suggested by @CodesInChaos, but otherwise you would generally do the following:

string s = "\U0002A6D6";

Side note: I wouldn't call the expansion past 2 bytes recent, it happened almost 20 years ago.

like image 37
DPenner1 Avatar answered Feb 26 '26 05:02

DPenner1



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!