I am trying to read a high Unicode character from one string into another. For brevity, I will simplify my code as shown below: <pre class="prettyprint"><code>public static void UnicodeTest() { var highUnicodeChar = "𝐀"; //Not the standard A var result1 = highUnicodeChar; //this works var result2 = highUnicodeChar[0].ToString(); // returns \ud835 } </code></pre> When I assign <code>highUnicodeChar</code> to <code>result1</code> directly, it retains its literal value of <code>𝐀</code>. When I try to access it by index, it returns <code>\ud835</code>. As I understand it, this is a surrogate pair of UTF-16 characters used to represent a UTF-32 character. I am pretty sure this problem has to do with trying to implicitly convert a <code>char</code> to a <code>string</code>. In the end, I want <code>result2</code> to yield the same value as <code>result1</code>. How can I do this?

In Unicode, you have code points. These are 21 bits long. Your character 𝐀, <code>Mathematical Bold Capital A</code>, has a code point of U+1D400. In Unicode encodings, you have code units. These are the natural unit of the encoding: 8-bit for UTF-8, 16-bit for UTF-16, and so on. One or more code units encode a single code point. In UTF-16, two code units that form a single code point are called a surrogate pair. Surrogate pairs are used to encode any code point greater than 16 bits, i.e. U+10000 and up. This gets a little tricky in .NET, as a .NET <code>Char</code> represents a single UTF-16 code unit, and a .NET <code>String</code> is a collection of code units. So your code point 𝐀 (U+1D400) can't fit in 16 bits and needs a surrogate pair, meaning your string has two code units in it: <pre class="prettyprint"><code>var highUnicodeChar = "𝐀"; char a = highUnicodeChar[0]; // code unit 0xD835 char b = highUnicodeChar[1]; // code unit 0xDC00 </code></pre> Meaning when you index into the string like that, you're actually only getting half of the surrogate pair. You can use IsSurrogatePair to test for a surrogate pair. For instance: <pre class="prettyprint"><code>string GetFullCodePointAtIndex(string s, int idx) => s.Substring(idx, char.IsSurrogatePair(s, idx) ? 2 : 1); </code></pre> Important to note that the rabbit hole of variable encoding in Unicode doesn't end at the code point. A grapheme cluster is the "visible thing" most people when asked would ultimately call a "character". A grapheme cluster is made from one or more code points: a base character, and zero or more combining characters. An example of a combining character is an umlaut or various other decorations/modifiers you might want to add. See this answer for a horrifying example of what combining characters can do. To test for a combining character, you can use GetUnicodeCategory to check for an enclosing mark, non-spacing mark, or spacing mark.

Convert Unicode surrogate pair to literal string

Tags:

c#

.net

unicode

unicode-escapes

I am trying to read a high Unicode character from one string into another. For brevity, I will simplify my code as shown below:

public static void UnicodeTest()
{
    var highUnicodeChar = "𝐀"; //Not the standard A

    var result1 = highUnicodeChar; //this works
    var result2 = highUnicodeChar[0].ToString(); // returns \ud835
}

When I assign highUnicodeChar to result1 directly, it retains its literal value of 𝐀. When I try to access it by index, it returns \ud835. As I understand it, this is a surrogate pair of UTF-16 characters used to represent a UTF-32 character. I am pretty sure this problem has to do with trying to implicitly convert a char to a string.

In the end, I want result2 to yield the same value as result1. How can I do this?

458

asked Oct 01 '18 03:10

hargle

Video Answer

1 Answers

In Unicode, you have code points. These are 21 bits long. Your character 𝐀, Mathematical Bold Capital A, has a code point of U+1D400.

In Unicode encodings, you have code units. These are the natural unit of the encoding: 8-bit for UTF-8, 16-bit for UTF-16, and so on. One or more code units encode a single code point.

In UTF-16, two code units that form a single code point are called a surrogate pair. Surrogate pairs are used to encode any code point greater than 16 bits, i.e. U+10000 and up.

This gets a little tricky in .NET, as a .NET Char represents a single UTF-16 code unit, and a .NET String is a collection of code units.

So your code point 𝐀 (U+1D400) can't fit in 16 bits and needs a surrogate pair, meaning your string has two code units in it:

var highUnicodeChar = "𝐀";
char a = highUnicodeChar[0]; // code unit 0xD835
char b = highUnicodeChar[1]; // code unit 0xDC00

Meaning when you index into the string like that, you're actually only getting half of the surrogate pair.

You can use IsSurrogatePair to test for a surrogate pair. For instance:

string GetFullCodePointAtIndex(string s, int idx) =>
    s.Substring(idx, char.IsSurrogatePair(s, idx) ? 2 : 1);

Important to note that the rabbit hole of variable encoding in Unicode doesn't end at the code point. A grapheme cluster is the "visible thing" most people when asked would ultimately call a "character". A grapheme cluster is made from one or more code points: a base character, and zero or more combining characters. An example of a combining character is an umlaut or various other decorations/modifiers you might want to add. See this answer for a horrifying example of what combining characters can do.

To test for a combining character, you can use GetUnicodeCategory to check for an enclosing mark, non-spacing mark, or spacing mark.

114

answered Oct 17 '22 06:10

Cory Nelson

Related questions
                            
                                Can a child class implement the same interface as its parent?
                            
                                Is it safe to assume DayOfWeek's numeric value?
                            
                                SignalR - Checking if a user is still connected
                            
                                How to remove an element from an xml using Xdocument when we have multiple elements with same name but different attributes
                            
                                How to attach CancellationTokenSource to DownloadStringTaskAsync method and cancel the async call?
                            
                                Accept Cookies in WebClient?
                            
                                Forcing HttpClient to use Content-Type: text/xml
                            
                                Is there a way to automatically generate equals and hashcode method in Visual Studio
                            
                                Prevent $id/$ref when serializing objects using Web API and JSON.NET
                            
                                Pausing within a MVC controller action
                            
                                Rounded edges in button C# (WinForms)
                            
                                Cannot find ConfigurationManager in class library
                            
                                WPF Grid.IsSharedSizeScope across multiple grids
                            
                                Why do we assign child class object to parent class reference variable?
                            
                                TaskCompletionSource throws "An attempt was made to transition a task to a final state when it had already completed"
                            
                                TargetName property cannot be set on a Style Setter, so how is it set?
                            
                                How to add json to RestSharp POST request
                            
                                Mock IEnumerable<T> using moq
                            
                                FormattedText.FormttedText is obsolete. Use the PixelsPerDip override
                            
                                what should be the key length in signingCredentials jwt asp.net core

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With