I'm seeing some very strange sorting behaviour using CaseInsensitiveComparer.DefaultInvariant. Words that start with a leading hyphen "-" end up sorted as if the hyphen wasn't there rather than being sorted in front of actual letters which is what happens with other punctuation.
So given { "hello", ".net", "-less"} I end up with {".net", "hello", "-less" } instead of the expected {"-less", ".net", "hello"}.
Or, phrased as a test case:
[TestMethod]
public void TestMethod1()
{
var rg = new String[] {
"x", "z", "y", "-less", ".net", "- more", "a", "b"
};
Array.Sort(rg, CaseInsensitiveComparer.DefaultInvariant);
Assert.AreEqual(
"- more,-less,.net,a,b,x,y,z",
String.Join(",", rg)
);
}
... which fails like this:
Assert.AreEqual failed.
Expected:<- more,-less,.net,a,b,x,y,z>.
Actual: <- more,.net,a,b,-less,x,y,z>.
Any ideas what's going on?
Looks like, by default .NET does fancy things when sorting strings which causes leading hyphens to be sorted into strange places so that co-op and coop sort together. Thus, if you want your leading hyphen words to end up and the begining with the other punctutation you have to tell it not not to:
Array.Sort(rg, (a, b) => String.CompareOrdinal(a, b));
Comparison procedures use the CultureInfo.InvariantCulture to determine the sort order and casing rules. String comparisons might have different results depending on the culture. For more information on culture-specific comparisons, see the System.Globalization namespace and Encoding and Localization. From here.
The interesting part:
A word sort performs a culture-sensitive comparison of strings in which certain nonalphanumeric Unicode characters might have special weights assigned to them. For example, the hyphen (-) might have a very small weight assigned to it so that "coop" and "co-op" appear next to each other in a sorted list. From here.
To sort the strings in the way you need, you have to create a comparer class that compares strings using the Compareinfo class. This class allow you to specify various methods of comparison, the one that best matches yor needs is OrdinalIgnoreCase.
From MSDN:
Ignored Search Values
Comparison operations, such as those performed by the IndexOf or LastIndexOf methods, can yield unexpected results if the value to search for is ignored. The search value is ignored if it is an empty string (""), a character or string consisting of characters having code points that are not considered in the operation because of comparison options, or a value with code points that have no linguistic significance. If the search value for the IndexOf method is an empty string, for example, the return value is zero.
Note
When possible, the application should use string comparison methods that accept a CompareOptions value to specify the kind of comparison expected. As a general rule, user-facing comparisons are best served by the use of linguistic options (using the current culture), while security comparisons should specify Ordinal or OrdinalIgnoreCase.specify Ordinal or OrdinalIgnoreCase.
I have modified your test case, and this one execute correctly:
public class MyComparer:Comparer<string>
{
private readonly CompareInfo compareInfo;
public MyComparer()
{
compareInfo = CompareInfo.GetCompareInfo(CultureInfo.InvariantCulture.Name);
}
public override int Compare(string x, string y)
{
return compareInfo.Compare(x, y, CompareOptions.OrdinalIgnoreCase);
}
}
public class Class1
{
[Test]
public void TestMethod1()
{
var rg = new String[] {
"x", "z", "y", "-less", ".net", "- more", "a", "b"
};
Array.Sort(rg, new MyComparer());
Assert.AreEqual(
"- more,-less,.net,a,b,x,y,z",
String.Join(",", rg)
);
}
}
My guess would be that a dash immedately before a letter is being ignored, for purposes of sorting. When you sort a list of words, you'd like "inter-nation" and "international" to be next to each other, wouldn't you? A dash by itself, on the other hand, is considered significant.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With