Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert Unicode surrogate pair to literal string

I am trying to read a high Unicode character from one string into another. For brevity, I will simplify my code as shown below:

public static void UnicodeTest()
{
    var highUnicodeChar = "𝐀"; //Not the standard A

    var result1 = highUnicodeChar; //this works
    var result2 = highUnicodeChar[0].ToString(); // returns \ud835
}

When I assign highUnicodeChar to result1 directly, it retains its literal value of 𝐀. When I try to access it by index, it returns \ud835. As I understand it, this is a surrogate pair of UTF-16 characters used to represent a UTF-32 character. I am pretty sure this problem has to do with trying to implicitly convert a char to a string.

In the end, I want result2 to yield the same value as result1. How can I do this?

like image 458
hargle Avatar asked Oct 01 '18 03:10

hargle


People also ask

What is surrogate pair in Unicode?

With surrogate pairs, a Unicode code point from range U+D800 to U+DBFF (called "high surrogate") gets combined with another Unicode code point from range U+DC00 to U+DFFF (called "low surrogate") to generate a whole new character, allowing the encoding of over one million additional characters.

What is a surrogate utf8?

The surrogate code points are used in UTF-16 to represent code points beyond FFFF . They are used in pairs, so a character is made of 4 bytes. This mechanism is not needed in UTF-8, so text encoded with UTF-8 shouldn't contain them.

What is surrogate character?

These characters have some special values; they are made up of two Unicode characters in two specific ranges such that the first Unicode character is in one range (for example 0xD800-0xD8FF) and the second Unicode character is in the second range (for example 0xDC00-0xDCFF). This is called a surrogate pair.

Where is UTF 32 used?

The main use of UTF-32 is in internal APIs where the data is single code points or glyphs, rather than strings of characters.


Video Answer


1 Answers

In Unicode, you have code points. These are 21 bits long. Your character 𝐀, Mathematical Bold Capital A, has a code point of U+1D400.

In Unicode encodings, you have code units. These are the natural unit of the encoding: 8-bit for UTF-8, 16-bit for UTF-16, and so on. One or more code units encode a single code point.

In UTF-16, two code units that form a single code point are called a surrogate pair. Surrogate pairs are used to encode any code point greater than 16 bits, i.e. U+10000 and up.

This gets a little tricky in .NET, as a .NET Char represents a single UTF-16 code unit, and a .NET String is a collection of code units.

So your code point 𝐀 (U+1D400) can't fit in 16 bits and needs a surrogate pair, meaning your string has two code units in it:

var highUnicodeChar = "𝐀";
char a = highUnicodeChar[0]; // code unit 0xD835
char b = highUnicodeChar[1]; // code unit 0xDC00

Meaning when you index into the string like that, you're actually only getting half of the surrogate pair.

You can use IsSurrogatePair to test for a surrogate pair. For instance:

string GetFullCodePointAtIndex(string s, int idx) =>
    s.Substring(idx, char.IsSurrogatePair(s, idx) ? 2 : 1);

Important to note that the rabbit hole of variable encoding in Unicode doesn't end at the code point. A grapheme cluster is the "visible thing" most people when asked would ultimately call a "character". A grapheme cluster is made from one or more code points: a base character, and zero or more combining characters. An example of a combining character is an umlaut or various other decorations/modifiers you might want to add. See this answer for a horrifying example of what combining characters can do.

To test for a combining character, you can use GetUnicodeCategory to check for an enclosing mark, non-spacing mark, or spacing mark.

like image 114
Cory Nelson Avatar answered Oct 17 '22 06:10

Cory Nelson