Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C# and UTF-16 characters

Tags:

c#

unicode

Is it possible in C# to use UTF-32 characters not in Plane 0 as a char?

string s = ""; // valid
char c = ''; // generates a compiler error ("Too many characters in character literal")

And in s it is represented by two characters, not one.

Edit: I mean, is there a character AN string type with full unicode support, UTF-32 or UTF-8 per character? For example if I want a for loop on utf-32 (maybe not in plane0) characters in a string.

like image 217
Dutow Avatar asked Mar 30 '09 12:03

Dutow


1 Answers

The string class represents a UTF-16 encoded block of text, and each char in a string represents a UTF-16 code value.

Although there is no BCL type that represents a single Unicode code point, there is support for Unicode characters beyond Plane 0 in the form of method overloads taking a string and an index instead of just a char. For example, the static GetUnicodeCategory(char) method on the System.Globalization.CharUnicodeInfo class has a corresponding GetUnicodeCategory(string,int) method that will recognize a simple character or a surrogate pair starting at the specified index.


To iterate through the text elements in a string, you can use the methods on the System.Globalization.StringInfo class. Here, a "text element" corresponds to a single character as displayed on screen. This means that simple characters ("a"), combining characters ("a\u0304\u0308" = "ā̈"), and surrogate pairs ("\uD950\uDF21" = "") will all be treated as a single text element.

Specifically, the GetTextElementEnumerator static method will allow you to enumerate over each text element in a string (see the linked MSDN page for a code example).

like image 63
Emperor XLII Avatar answered Sep 18 '22 13:09

Emperor XLII