How can I return the Unicode Code Point of a character? For example, if the input is "A", then the output should be "U+0041". Ideally, a solution should take care of surrogate pairs.
With code point I mean the actual code point according to Unicode, which is different from code unit (UTF8 has 8-bit code units, UTF16 has 16-bit code units and UTF32 has 32-bit code units, in the latter case the value is equal to the code point, after taking endianness into account).
The following code writes the codepoints of a string
input to the console:
string input = "\uD834\uDD61"; for (var i = 0; i < input.Length; i += char.IsSurrogatePair(input, i) ? 2 : 1) { var codepoint = char.ConvertToUtf32(input, i); Console.WriteLine("U+{0:X4}", codepoint); }
Output:
U+1D161
Since strings in .NET are UTF-16 encoded, the char
values that make up the string need to be converted to UTF-32 first.
Easy, since chars in C# is actually UTF16 code points:
char x = 'A'; Console.WriteLine("U+{0:x4}", (int)x);
To address the comments, A char
in C# is a 16 bit number, and holds a UTF16 code point. Code points above 16 the bit space cannot be represented in a C# character. Characters in C# is not variable width. A string however can have 2 chars following each other, each being a code unit, forming a UTF16 code point. If you have a string input and characters above the 16 bit space, you can use char.IsSurrogatePair
and Char.ConvertToUtf32
, as suggested in another answer:
string input = .... for(int i = 0 ; i < input.Length ; i += Char.IsSurrogatePair(input,i) ? 2 : 1) { int x = Char.ConvertToUtf32(input, i); Console.WriteLine("U+{0:X4}", x); }
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With