I am trying to read a high Unicode character from one string into another. For brevity, I will simplify my code as shown below:
public static void UnicodeTest()
{
var highUnicodeChar = "𝐀"; //Not the standard A
var result1 = highUnicodeChar; //this works
var result2 = highUnicodeChar[0].ToString(); // returns \ud835
}
When I assign highUnicodeChar
to result1
directly, it retains its literal value of 𝐀
. When I try to access it by index, it returns \ud835
. As I understand it, this is a surrogate pair of UTF-16 characters used to represent a UTF-32 character. I am pretty sure this problem has to do with trying to implicitly convert a char
to a string
.
In the end, I want result2
to yield the same value as result1
. How can I do this?
With surrogate pairs, a Unicode code point from range U+D800 to U+DBFF (called "high surrogate") gets combined with another Unicode code point from range U+DC00 to U+DFFF (called "low surrogate") to generate a whole new character, allowing the encoding of over one million additional characters.
The surrogate code points are used in UTF-16 to represent code points beyond FFFF . They are used in pairs, so a character is made of 4 bytes. This mechanism is not needed in UTF-8, so text encoded with UTF-8 shouldn't contain them.
These characters have some special values; they are made up of two Unicode characters in two specific ranges such that the first Unicode character is in one range (for example 0xD800-0xD8FF) and the second Unicode character is in the second range (for example 0xDC00-0xDCFF). This is called a surrogate pair.
The main use of UTF-32 is in internal APIs where the data is single code points or glyphs, rather than strings of characters.
In Unicode, you have code points. These are 21 bits long. Your character 𝐀, Mathematical Bold Capital A
, has a code point of U+1D400.
In Unicode encodings, you have code units. These are the natural unit of the encoding: 8-bit for UTF-8, 16-bit for UTF-16, and so on. One or more code units encode a single code point.
In UTF-16, two code units that form a single code point are called a surrogate pair. Surrogate pairs are used to encode any code point greater than 16 bits, i.e. U+10000 and up.
This gets a little tricky in .NET, as a .NET Char
represents a single UTF-16 code unit, and a .NET String
is a collection of code units.
So your code point 𝐀 (U+1D400) can't fit in 16 bits and needs a surrogate pair, meaning your string has two code units in it:
var highUnicodeChar = "𝐀";
char a = highUnicodeChar[0]; // code unit 0xD835
char b = highUnicodeChar[1]; // code unit 0xDC00
Meaning when you index into the string like that, you're actually only getting half of the surrogate pair.
You can use IsSurrogatePair to test for a surrogate pair. For instance:
string GetFullCodePointAtIndex(string s, int idx) =>
s.Substring(idx, char.IsSurrogatePair(s, idx) ? 2 : 1);
Important to note that the rabbit hole of variable encoding in Unicode doesn't end at the code point. A grapheme cluster is the "visible thing" most people when asked would ultimately call a "character". A grapheme cluster is made from one or more code points: a base character, and zero or more combining characters. An example of a combining character is an umlaut or various other decorations/modifiers you might want to add. See this answer for a horrifying example of what combining characters can do.
To test for a combining character, you can use GetUnicodeCategory to check for an enclosing mark, non-spacing mark, or spacing mark.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With