Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why do C# System.Char methods for Unicode property tests have two overloads?

Tags:

c#

char

unicode

In the methods of System.Char, we see two methods for checking if a character is a symbol:

public static bool IsSymbol(string s, int index)
public static bool IsSymbol(char c)

and likewise for other property tests: IsLower, IsLetter, etc.

Why is there this duplication? Is there any reason to prefer Char.IsSymbol(s, idx) over Char.IsSymbol(s[idx])?

like image 912
ridiculous_fish Avatar asked Jun 08 '16 19:06

ridiculous_fish


1 Answers

At the surface both overloads appear to be functionally the same, however drilling down to the call to InternalGetUnicodeCategory will reveal that they result in calls different overloads of CharUnicodeInfo.GetUnicodeCateogry.

The string,int overload ends up running through a UTF32 conversion via InternalConvertToUtf32 prior to calling the same single char InternalGetUnicodeCategory function. This accounts for the possibility of decoding Surrogate Pairs in a UTF16 encoded character.

   internal static UnicodeCategory InternalGetUnicodeCategory(String value, int index) {
        Contract.Assert(value != null, "value can not be null");
        Contract.Assert(index < value.Length, "index < value.Length");

        return (InternalGetUnicodeCategory(InternalConvertToUtf32(value, index)));
    }

Check out the Conversion implementation here if you want.

Why does this matter you may ask? Well the answer to that is that .Net supports Text Elements. Microsoft states:

MSDN Documentation on Unicode Support for Surrogate Pairs

A text element is a unit of text that is displayed as a single character, called a grapheme. A text element can be a base character, a surrogate pair, or a combining character sequence.

While I do not believe the IsSymbol function and its relatives can decode Graphemes or combining character sequences, the reason for the callout to Text Elements is that they can be defined as a surrogate pair, and as such would need to be decoded via the string,int overload of IsSymbol(), IsLetter() etc...

What this means is that passing a surrogate pair via the char overload would return the wrong result because the character in the string could be a surrogate pair. You cannot assume that a 16-bit encoding represents a single character, and passing the string's character at said index would make that assumption.

Because surrogate pairs can be represented in a string in .Net, it would reason that if you are dealing with a string that could contain one of these, the IsSymbol(string s, int index) overload would be more appropriate in order to cover the case where one of these pairs was present.

A specific example where the results differ is

string s = char.ConvertFromUtf32(128204); // "📌"

Debug.Assert(char.IsSymbol(s[0]) == char.IsSymbol(s, 0)); // Fails
like image 166
Evan L Avatar answered Nov 15 '22 19:11

Evan L