In the methods of System.Char, we see two methods for checking if a character is a symbol:
public static bool IsSymbol(string s, int index)
public static bool IsSymbol(char c)
and likewise for other property tests: IsLower, IsLetter, etc.
Why is there this duplication? Is there any reason to prefer Char.IsSymbol(s, idx)
over Char.IsSymbol(s[idx])
?
At the surface both overloads appear to be functionally the same, however drilling down to the call to InternalGetUnicodeCategory
will reveal that they result in calls different overloads of CharUnicodeInfo.GetUnicodeCateogry
.
The string,int
overload ends up running through a UTF32 conversion via InternalConvertToUtf32
prior to calling the same single char
InternalGetUnicodeCategory
function. This accounts for the possibility of decoding Surrogate Pairs in a UTF16 encoded character.
internal static UnicodeCategory InternalGetUnicodeCategory(String value, int index) {
Contract.Assert(value != null, "value can not be null");
Contract.Assert(index < value.Length, "index < value.Length");
return (InternalGetUnicodeCategory(InternalConvertToUtf32(value, index)));
}
Check out the Conversion implementation here if you want.
Why does this matter you may ask? Well the answer to that is that .Net supports Text Elements. Microsoft states:
MSDN Documentation on Unicode Support for Surrogate Pairs
A text element is a unit of text that is displayed as a single character, called a grapheme. A text element can be a base character, a surrogate pair, or a combining character sequence.
While I do not believe the IsSymbol
function and its relatives can decode Graphemes or combining character sequences, the reason for the callout to Text Elements is that they can be defined as a surrogate pair, and as such would need to be decoded via the string,int
overload of IsSymbol(), IsLetter()
etc...
What this means is that passing a surrogate pair via the char
overload would return the wrong result because the character in the string could be a surrogate pair. You cannot assume that a 16-bit encoding represents a single character, and passing the string's character at said index would make that assumption.
Because surrogate pairs can be represented in a string in .Net, it would reason that if you are dealing with a string that could contain one of these, the IsSymbol(string s, int index)
overload would be more appropriate in order to cover the case where one of these pairs was present.
A specific example where the results differ is
string s = char.ConvertFromUtf32(128204); // "📌"
Debug.Assert(char.IsSymbol(s[0]) == char.IsSymbol(s, 0)); // Fails
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With