I have a list of character range restrictions that I need to check a string against, but the char
type in .NET is UTF-16 and therefore some characters become wacky (surrogate) pairs instead. Thus when enumerating all the char
's in a string
, I don't get the 32-bit Unicode code points and some comparisons with high values fail.
I understand Unicode well enough that I could parse the bytes myself if necessary, but I'm looking for a C#/.NET Framework BCL solution. So ...
How would you convert a string
to an array (int[]
) of 32-bit Unicode code points?
The equivalent in C# is the String class. According to MSDN: (A String) Represents text as a series of Unicode characters. So, if you do string str = "a string here"; , you have a Unicode string.
Unicode is a standard for character encoding and decoding for computers. You can use various encodings from Unicode, UTF-8 (8 bit) UTF-16 (16 bit), and so on.
In this article, I will explain C# String Encoding/Decoding and Conversions in C#. stringEncodeDecode.zip. All strings in a . NET Framework program are stored as 16-bit Unicode characters. At times you might need to convert from Unicode to some other character encoding, or from some other character encoding to Unicode.
NET uses UTF-16 to encode the text in a string . A char instance represents a 16-bit code unit.
You are asking about code points. In UTF-16 (C#'s char
) there are only two possibilities:
Therefore, assuming the string is valid, this returns an array of code points for a given string:
public static int[] ToCodePoints(string str) { if (str == null) throw new ArgumentNullException("str"); var codePoints = new List<int>(str.Length); for (int i = 0; i < str.Length; i++) { codePoints.Add(Char.ConvertToUtf32(str, i)); if (Char.IsHighSurrogate(str[i])) i += 1; } return codePoints.ToArray(); }
An example with a surrogate pair 🌀
and a composed character ñ
:
ToCodePoints("\U0001F300 El Ni\u006E\u0303o"); // 🌀 El Niño // { 0x1f300, 0x20, 0x45, 0x6c, 0x20, 0x4e, 0x69, 0x6e, 0x303, 0x6f } // 🌀 E l N i n ̃◌ o
Here's another example. These two code points represents a 32th musical note with a staccato accent, both surrogate pairs:
ToCodePoints("\U0001D162\U0001D181"); // 𝅘𝅥𝅰𝆁 // { 0x1d162, 0x1d181 } // 𝅘𝅥𝅰 𝆁◌
When C-normalized, they are decomposed into a notehead, combining stem, combining flag and combining accent-staccato, all surrogate pairs:
ToCodePoints("\U0001D162\U0001D181".Normalize()); // 𝅘𝅥𝅰𝆁 // { 0x1d158, 0x1d165, 0x1d170, 0x1d181 } // 𝅘 𝅥 𝅰 𝆁◌
Note that leppie's solution is not correct. The question is about code points, not text elements. A text element is a combination of code points that together form a single grapheme. For example, in the example above, the ñ
in the string is represented by a Latin lowercase n
followed by a combining tilde ̃◌
. Leppie's solution discards any combining characters that cannot be normalized into a single code point.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With