Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Weird string sorting when 2nd string is longer

Tags:

string

c#

sorting

Comparing "î"

string.Compare("î", "I ", StringComparison.CurrentCulture) -- returns -1
string.Compare("î", "I ", StringComparison.CurrentCultureIgnoreCase) -- returns -1
string.Compare("î", "I", StringComparison.CurrentCulture) -- returns 1 (unexpected)
string.Compare("î", "I", StringComparison.CurrentCultureIgnoreCase) -- returns 1  (unexpected)

With "i"

string.Compare("i", "I ", StringComparison.CurrentCulture) -- returns -1
string.Compare("i", "I ", StringComparison.CurrentCultureIgnoreCase) -- returns -1
string.Compare("i", "I", StringComparison.CurrentCulture) -- returns -1
string.Compare("i", "I", StringComparison.CurrentCultureIgnoreCase) -- returns 0

Current culture was en-GB. I would expect all of these to return 1. Why does having a longer string change the sort order?

like image 635
Jon Rea Avatar asked May 17 '13 11:05

Jon Rea


2 Answers

See the UTS#10: Unicode Collation Algorithm for the full details.

In particular, see section 1.1 Multi-Level Comparison which explains this behaviour.

There's a table there showing some examples, such as this one:

role < rôle < roles

That is analogous to your example with "I" , "î" and "I ", i.e.:

"I" < "î" < "I "

except where roles has an s at the end, your example has a space at the end. But the same logic applies; it's irrelevant what the extra character is - the simple fact that there is an extra character makes it sort AFTER the "î".

A crucial point from the spec is:

Accent differences are typically ignored, if the base letters differ.

The base letters differ if the lengths differ, so the accent differences are ignored in your examples with the space at the end.

However, where the strings are the same length, the accent differences are not being ignored - which is exactly the results you are seeing.

like image 165
Matthew Watson Avatar answered Nov 12 '22 05:11

Matthew Watson


From the Documentation

The comparison terminates when an inequality is discovered or both strings have been compared. However, if the two strings compare equal to the end of one string, and the other string has characters remaining, then the string with remaining characters is considered greater. The return value is the result of the last comparison performed.

like image 2
bastos.sergio Avatar answered Nov 12 '22 05:11

bastos.sergio