Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unexpected behavior when sorting strings with letters and dashes

Tags:

c#

sorting

If I have some list of strings contain all numbers and dashes they will sort ascending like so:

s = s.OrderBy(t => t).ToList();

66-0616280-000
66-0616280-100
66-06162801000
66-06162801040

This is as expected.

However, if the strings contain letters, the sort is somewhat unexpected. For example, here is the same list of string with trailing A's replacing the 0s, and yes, it is sorted:

66-0616280-00A
66-0616280100A
66-0616280104A
66-0616280-10A

I would have expected them to sort like so:

66-0616280-00A
66-0616280-10A
66-0616280100A
66-0616280104A

Why does the sort behave differently on the string when it contains letters vs. when it contains only numbers?

Thanks in advance.

like image 221
BBauer42 Avatar asked Feb 19 '14 16:02

BBauer42


People also ask

Does alphabetical sort prioritize the numeric parts of a string?

A normal Excel alphabetical sort will not prioritize the numeric parts of a string on its own. Here is how to sort alphanumeric data with complete control… Let’s imagine you have a list of employees. It might have their first name, last name, and job title.

How do I sort data by the en sort criteria?

Then click on the Sort button on the Data tab of the menu. Choose to sort by the EN Sort 1 column. Click the “Add Level” button to specify a second sort criteria. Choose to sort by the EN Sort 2 column in the second level.

What do you do with the built-in sort function?

Excel’s built-in sort functions are incredibly useful for organizing data and ordering information like dates, times, and other numerical inputs. Sometimes, though, you may need to sort product IDs, employee numbers, or other information that has letters and numbers in it. Then, what do you do?

How does excel sort text in a cell?

Since the cells contain text as well as numbers, Excel treats the entire cell like a text string. It sorts according to the order the “letters” appear instead of the entire number (e.g. the “1” in “P42-16” comes before the “5” in “P42-5”). We’re going to need to do a bit more work to make Excel do our bidding….


1 Answers

It's because the default StringComparer is culture-sensitive. As far as I can tell, Comparer<string>.Default delegates to string.CompareTo(string) which uses the current culture:

This method performs a word (case-sensitive and culture-sensitive) comparison using the current culture. For more information about word, string, and ordinal sorts, see System.Globalization.CompareOptions.

Then the page for CompareOptions includes:

The .NET Framework uses three distinct ways of sorting: word sort, string sort, and ordinal sort. Word sort performs a culture-sensitive comparison of strings. Certain nonalphanumeric characters might have special weights assigned to them. For example, the hyphen ("-") might have a very small weight assigned to it so that "coop" and "co-op" appear next to each other in a sorted list. String sort is similar to word sort, except that there are no special cases. Therefore, all nonalphanumeric symbols come before all alphanumeric characters. Ordinal sort compares strings based on the Unicode values of each element of the string.

("Small weight" isn't quite the same as "ignored" as quoted in Andrei's answer, but the effects are similar here.)

If you specify StringComparer.Ordinal, you get results of:

66-0616280-00A
66-0616280-10A
66-0616280100A
66-0616280104A

Specify it as the second argument to OrderBy:

s = s.OrderBy(t => t, StringComparer.Ordinal).ToList();

You can see the difference here:

Console.WriteLine(Comparer<string>.Default.Compare
    ("66-0616280104A", "66-0616280-10A"));
Console.WriteLine(StringComparer.Ordinal.Compare
    ("66-0616280104A", "66-0616280-10A"));
like image 107
Jon Skeet Avatar answered Sep 21 '22 10:09

Jon Skeet