How would you get an array of Unicode code points from a .NET String?

Tags:

I have a list of character range restrictions that I need to check a string against, but the char type in .NET is UTF-16 and therefore some characters become wacky (surrogate) pairs instead. Thus when enumerating all the char's in a string, I don't get the 32-bit Unicode code points and some comparisons with high values fail.

I understand Unicode well enough that I could parse the bytes myself if necessary, but I'm looking for a C#/.NET Framework BCL solution. So ...

How would you convert a string to an array (int[]) of 32-bit Unicode code points?

775

asked Mar 26 '09 20:03

Neil C. Obremski

1 Answers

You are asking about code points. In UTF-16 (C#'s char) there are only two possibilities:

The character is from the Basic Multilingual Plane, and is encoded by a single code unit.
The character is outside the BMP, and encoded using a surrogare high-low pair of code units

Therefore, assuming the string is valid, this returns an array of code points for a given string:

public static int[] ToCodePoints(string str) {     if (str == null)         throw new ArgumentNullException("str");      var codePoints = new List<int>(str.Length);     for (int i = 0; i < str.Length; i++)     {         codePoints.Add(Char.ConvertToUtf32(str, i));         if (Char.IsHighSurrogate(str[i]))             i += 1;     }      return codePoints.ToArray(); }

An example with a surrogate pair 🌀 and a composed character ñ:

ToCodePoints("\U0001F300 El Ni\u006E\u0303o");                        // 🌀 El Niño // { 0x1f300, 0x20, 0x45, 0x6c, 0x20, 0x4e, 0x69, 0x6e, 0x303, 0x6f } // 🌀   E l   N i n ̃◌ o

Here's another example. These two code points represents a 32th musical note with a staccato accent, both surrogate pairs:

ToCodePoints("\U0001D162\U0001D181");              // 𝅘𝅥𝅰𝆁 // { 0x1d162, 0x1d181 }                            // 𝅘𝅥𝅰 𝆁◌

When C-normalized, they are decomposed into a notehead, combining stem, combining flag and combining accent-staccato, all surrogate pairs:

ToCodePoints("\U0001D162\U0001D181".Normalize());  // 𝅘𝅥𝅰𝆁 // { 0x1d158, 0x1d165, 0x1d170, 0x1d181 }          // 𝅘 𝅥 𝅰 𝆁◌

Note that leppie's solution is not correct. The question is about code points, not text elements. A text element is a combination of code points that together form a single grapheme. For example, in the example above, the ñ in the string is represented by a Latin lowercase n followed by a combining tilde ̃◌. Leppie's solution discards any combining characters that cannot be normalized into a single code point.

121

answered Sep 27 '22 20:09

Daniel A.A. Pelsmaeker

Related questions
                            
                                Can I have a different mail server for each subdomain?
                            
                                javascript to get paragraph of selected text in web page
                            
                                Is it possible to customize the indent style of XCode?
                            
                                How can I pretty-print XML source using VB6 and MSXML?
                            
                                Speed and style of Math.max vs ternary operator in JavaScript
                            
                                Set Authorization header using PHP and curl
                            
                                The "Optimize code" checkbox in Visual Studio. What exactly does it do?
                            
                                JSON decoding in c# [closed]
                            
                                How to insert into a table with just one IDENTITY column (SQL Express)
                            
                                Can JQuery UI and JQuery tools work together?
                            
                                How to improve the way I use Textmate for Ruby on Rails, HTML, CSS and Javascript?
                            
                                How to tame the Windows headers (useful defines)?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With