Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How would you get an array of Unicode code points from a .NET String?

Tags:

I have a list of character range restrictions that I need to check a string against, but the char type in .NET is UTF-16 and therefore some characters become wacky (surrogate) pairs instead. Thus when enumerating all the char's in a string, I don't get the 32-bit Unicode code points and some comparisons with high values fail.

I understand Unicode well enough that I could parse the bytes myself if necessary, but I'm looking for a C#/.NET Framework BCL solution. So ...

How would you convert a string to an array (int[]) of 32-bit Unicode code points?

like image 775
Neil C. Obremski Avatar asked Mar 26 '09 20:03

Neil C. Obremski


People also ask

Are C# strings Unicode?

The equivalent in C# is the String class. According to MSDN: (A String) Represents text as a series of Unicode characters. So, if you do string str = "a string here"; , you have a Unicode string.

What is a Unicode character in C#?

Unicode is a standard for character encoding and decoding for computers. You can use various encodings from Unicode, UTF-8 (8 bit) UTF-16 (16 bit), and so on.

What is string encoding in C#?

In this article, I will explain C# String Encoding/Decoding and Conversions in C#. stringEncodeDecode.zip. All strings in a . NET Framework program are stored as 16-bit Unicode characters. At times you might need to convert from Unicode to some other character encoding, or from some other character encoding to Unicode.

Are .NET strings UTF-16?

NET uses UTF-16 to encode the text in a string . A char instance represents a 16-bit code unit.


1 Answers

You are asking about code points. In UTF-16 (C#'s char) there are only two possibilities:

  1. The character is from the Basic Multilingual Plane, and is encoded by a single code unit.
  2. The character is outside the BMP, and encoded using a surrogare high-low pair of code units

Therefore, assuming the string is valid, this returns an array of code points for a given string:

public static int[] ToCodePoints(string str) {     if (str == null)         throw new ArgumentNullException("str");      var codePoints = new List<int>(str.Length);     for (int i = 0; i < str.Length; i++)     {         codePoints.Add(Char.ConvertToUtf32(str, i));         if (Char.IsHighSurrogate(str[i]))             i += 1;     }      return codePoints.ToArray(); } 

An example with a surrogate pair 🌀 and a composed character ñ:

ToCodePoints("\U0001F300 El Ni\u006E\u0303o");                        // 🌀 El Niño // { 0x1f300, 0x20, 0x45, 0x6c, 0x20, 0x4e, 0x69, 0x6e, 0x303, 0x6f } // 🌀   E l   N i n ̃◌ o 

Here's another example. These two code points represents a 32th musical note with a staccato accent, both surrogate pairs:

ToCodePoints("\U0001D162\U0001D181");              // 𝅘𝅥𝅰𝆁 // { 0x1d162, 0x1d181 }                            // 𝅘𝅥𝅰 𝆁◌ 

When C-normalized, they are decomposed into a notehead, combining stem, combining flag and combining accent-staccato, all surrogate pairs:

ToCodePoints("\U0001D162\U0001D181".Normalize());  // 𝅘𝅥𝅰𝆁 // { 0x1d158, 0x1d165, 0x1d170, 0x1d181 }          // 𝅘 𝅥 𝅰 𝆁◌ 

Note that leppie's solution is not correct. The question is about code points, not text elements. A text element is a combination of code points that together form a single grapheme. For example, in the example above, the ñ in the string is represented by a Latin lowercase n followed by a combining tilde ̃◌. Leppie's solution discards any combining characters that cannot be normalized into a single code point.

like image 121
Daniel A.A. Pelsmaeker Avatar answered Sep 27 '22 20:09

Daniel A.A. Pelsmaeker