Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Return code point of characters in C#

Tags:

How can I return the Unicode Code Point of a character? For example, if the input is "A", then the output should be "U+0041". Ideally, a solution should take care of surrogate pairs.

With code point I mean the actual code point according to Unicode, which is different from code unit (UTF8 has 8-bit code units, UTF16 has 16-bit code units and UTF32 has 32-bit code units, in the latter case the value is equal to the code point, after taking endianness into account).

like image 285
FSm Avatar asked Dec 15 '12 16:12

FSm


2 Answers

The following code writes the codepoints of a string input to the console:

string input = "\uD834\uDD61";  for (var i = 0; i < input.Length; i += char.IsSurrogatePair(input, i) ? 2 : 1) {     var codepoint = char.ConvertToUtf32(input, i);      Console.WriteLine("U+{0:X4}", codepoint); } 

Output:

U+1D161

Since strings in .NET are UTF-16 encoded, the char values that make up the string need to be converted to UTF-32 first.

like image 141
dtb Avatar answered Sep 18 '22 01:09

dtb


Easy, since chars in C# is actually UTF16 code points:

char x = 'A'; Console.WriteLine("U+{0:x4}", (int)x); 

To address the comments, A char in C# is a 16 bit number, and holds a UTF16 code point. Code points above 16 the bit space cannot be represented in a C# character. Characters in C# is not variable width. A string however can have 2 chars following each other, each being a code unit, forming a UTF16 code point. If you have a string input and characters above the 16 bit space, you can use char.IsSurrogatePair and Char.ConvertToUtf32, as suggested in another answer:

string input = .... for(int i = 0 ; i < input.Length ; i += Char.IsSurrogatePair(input,i) ? 2 : 1) {     int x = Char.ConvertToUtf32(input, i);     Console.WriteLine("U+{0:X4}", x); } 
like image 38
driis Avatar answered Sep 19 '22 01:09

driis