I have a list of character range restrictions that I need to check a string against, but the char type in .NET is UTF-16 and therefore some characters become wacky (surrogate) pairs instead. Thus when enumerating all the char's in a string, I don't get the 32-bit Unicode code points and some comparisons with high values fail.
I understand Unicode well enough that I could parse the bytes myself if necessary, but I'm looking for a C#/.NET Framework BCL solution. So ...
How would you convert a string to an array (int[]) of 32-bit Unicode code points?
The equivalent in C# is the String class. According to MSDN: (A String) Represents text as a series of Unicode characters. So, if you do string str = "a string here"; , you have a Unicode string.
Unicode is a standard for character encoding and decoding for computers. You can use various encodings from Unicode, UTF-8 (8 bit) UTF-16 (16 bit), and so on.
In this article, I will explain C# String Encoding/Decoding and Conversions in C#. stringEncodeDecode.zip. All strings in a . NET Framework program are stored as 16-bit Unicode characters. At times you might need to convert from Unicode to some other character encoding, or from some other character encoding to Unicode.
NET uses UTF-16 to encode the text in a string . A char instance represents a 16-bit code unit.
You are asking about code points. In UTF-16 (C#'s char) there are only two possibilities:
Therefore, assuming the string is valid, this returns an array of code points for a given string:
public static int[] ToCodePoints(string str) { if (str == null) throw new ArgumentNullException("str"); var codePoints = new List<int>(str.Length); for (int i = 0; i < str.Length; i++) { codePoints.Add(Char.ConvertToUtf32(str, i)); if (Char.IsHighSurrogate(str[i])) i += 1; } return codePoints.ToArray(); } An example with a surrogate pair 🌀 and a composed character ñ:
ToCodePoints("\U0001F300 El Ni\u006E\u0303o"); // 🌀 El Niño // { 0x1f300, 0x20, 0x45, 0x6c, 0x20, 0x4e, 0x69, 0x6e, 0x303, 0x6f } // 🌀 E l N i n ̃◌ o Here's another example. These two code points represents a 32th musical note with a staccato accent, both surrogate pairs:
ToCodePoints("\U0001D162\U0001D181"); // 𝅘𝅥𝅰𝆁 // { 0x1d162, 0x1d181 } // 𝅘𝅥𝅰 𝆁◌ When C-normalized, they are decomposed into a notehead, combining stem, combining flag and combining accent-staccato, all surrogate pairs:
ToCodePoints("\U0001D162\U0001D181".Normalize()); // 𝅘𝅥𝅰𝆁 // { 0x1d158, 0x1d165, 0x1d170, 0x1d181 } // 𝅘 𝅥 𝅰 𝆁◌ Note that leppie's solution is not correct. The question is about code points, not text elements. A text element is a combination of code points that together form a single grapheme. For example, in the example above, the ñ in the string is represented by a Latin lowercase n followed by a combining tilde ̃◌. Leppie's solution discards any combining characters that cannot be normalized into a single code point.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With