I trying to handle to following character: ⨝ (http://www.fileformat.info/info/unicode/char/2a1d/index.htm)
If you checking whether an empty string starting with this character, it always returns true, this does not make any sense! Why is that?
// visual studio 2008 hides lines that have this char literally (bug in visual studio?!?) so i wrote it's unicode instead.
char specialChar = (char)10781;
string specialString = specialChar.ToString();
// prints 1
Console.WriteLine(specialString.Length);
// prints 10781
Console.WriteLine((int)specialChar);
// prints false
Console.WriteLine(string.Empty.StartsWith("A"));
// both prints true WTF?!?
Console.WriteLine(string.Empty.StartsWith(specialString));
Console.WriteLine(string.Empty.StartsWith(((char)10781).ToString()));
You can fix this bug by using ordinal StringComparison:
From the MSDN docs:
When you specify either StringComparison.Ordinal or StringComparison.OrdinalIgnoreCase, the string comparison will be non-linguistic. That is, the features that are specific to the natural language are ignored when making comparison decisions. This means the decisions are based on simple byte comparisons and ignore casing or equivalence tables that are parameterized by culture. As a result, by explicitly setting the parameter to either the StringComparison.Ordinal or StringComparison.OrdinalIgnoreCase, your code often gains speed, increases correctness, and becomes more reliable.
char specialChar = (char)10781;
string specialString = Convert.ToString(specialChar);
// prints 1
Console.WriteLine(specialString.Length);
// prints 10781
Console.WriteLine((int)specialChar);
// prints false
Console.WriteLine(string.Empty.StartsWith("A"));
// prints false
Console.WriteLine(string.Empty.StartsWith(specialString, StringComparison.Ordinal));
Nice unicode glitch ;-p
I'm not sure why it does this, but amusingly:
Console.WriteLine(string.Empty.StartsWith(specialString)); // true
Console.WriteLine(string.Empty.Contains(specialString)); // false
Console.WriteLine("abc".StartsWith(specialString)); // true
Console.WriteLine("abc".Contains(specialString)); // false
I'm guessing this is treated a bit like the non-joining character that Jon mentioned at devdays; some string functions see it, and some don't. And if it doesn't see it, this becomes "does (some string) start with an empty string", which is always true.
The underlying reason for this is the default string comparison is locale aware. This means using tables of locale data for comparisons (including equality).
Many (if not most) Unicode characters have no value for many locales, and thus don't exist (or do, but match anything, or nothing).
See entries on character weights on Michael Kaplan's blog "Sorting It All Out". This series of blogs contains a lot of background information (the APIs are native, but—as I understand—the mechanisms in .NET are the same).
Quick version: this is a complex area to get expected (normal language) comparisons right is hard, this tends to lead to odd things with code points for glyphs outside your language.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With