Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does "\u1FFF:foo".StartsWith(":") return true?

Tags:

c#

.net

clr

f#

The string "\u1FFF:foo" starts with \u1FFF (or "῿"), right?

So how can these both be true?

"\u1FFF:foo".StartsWith(":")       // equals true
"\u1FFF:foo".StartsWith("\u1FFF")  // equals true

// alternatively, the same:
"῿:foo".StartsWith(":")           // equals true
"῿:foo".StartsWith("῿")          // equals true

Does .NET claim that this string starts with two different characters?

And while I find this very surprising and would like to understand the "why", I'm equally interested in how I can force .NET to search exclusively by codepoints instead (using InvariantCulture doesn't seem to do a thing)?

And for comparison, one characters below that, "\u1FFE:foo".StartsWith(":") returns false.

like image 479
Abel Avatar asked Nov 09 '17 19:11

Abel


2 Answers

That a string in general might be considered to start with two different strings that are not byte-for-byte identical is not surprising (because Unicode is complicated). For example, these results are almost always going to reflect what a user wants:

"n\u0303".StartsWith("\u00f1") // true
"n\u0303".StartsWith("n")      // false

Using System.Globalization.CharUnicodeInfo.GetUnicodeCategory, you can see that '\u1fff' is in the "OtherNotAssigned" category; it's unclear to me whether that should affect string search/sort/comparison operations (it does not appear to affect normalization, that is, the characters remain after normalization).

If you want a byte-for-byte comparison, use StringComparison.Ordinal.

like image 153
kvb Avatar answered Sep 19 '22 19:09

kvb


Because you are using String.StartsWith() incorrectly. You should use String.StartsWith (String, StringComparison) overload and StringComparison.Ordinal.

There is no character assigned to \u1FFF. I.e. there is no linguistic meaning attached to this code. See Greek Extended, Range: 1F00–1FFF excerpt from character code tables for Unicode Standard. Best Practices for Using Strings in .NET document from MSDN explicitly states that if you need to compare strings in a manner that ignores features of natural languages then you should use StringComparison.Ordinal:

Specifying the StringComparison.Ordinal or StringComparison.OrdinalIgnoreCase value in a method call signifies a non-linguistic comparison in which the features of natural languages are ignored. Methods that are invoked with these StringComparison values base string operation decisions on simple byte comparisons instead of casing or equivalence tables that are parameterized by culture. In most cases, this approach best fits the intended interpretation of strings while making code faster and more reliable.

Moreover, it recommends to always explicitly specify StringComparison in such method calls:

When you develop with .NET, follow these simple recommendations when you use strings:

  • Use overloads that explicitly specify the string comparison rules for string operations. Typically, this involves calling a method overload that has a parameter of type StringComparison.
like image 33
Leonid Vasilev Avatar answered Sep 23 '22 19:09

Leonid Vasilev