Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Linq OrderBy on generic list returns not entirely alphabetical list

Tags:

c#

linq

I am trying to sorting a generic list of objects using the objects Name property. I am using LINQ and the following expressions doesn't quite work:

var query = possibleWords.OrderBy(x => x.Name.ToLower()).ToList();
foreach (Word word in query) //possibleWords.OrderBy(word => word.Name))
   {
            listWords.Items.Add(word.Name);
   }

"query" should now contain a list of ordered items, if I understand it correctly and item should be added to the listbox named listWords.

However the output is this:

http://screencast.com/t/s1CkkWfXD4 (sorry for the URL link, but SO has somehow locked me out of my account and I apparently can't post images with this new one).

The listbox is almost alphabetical but not quite. For some reason "aa" and "aaaa" comes last. What can be the reason, and how to resolve it?

Thanks in advance.

Elaboration by request

This code, when entered in Visual Studio and executed:

        List<Word> words = new List<Word>();

        words.Add(new Word("a"));
        words.Add(new Word("Calculator"));
        words.Add(new Word("aaa"));
        words.Add(new Word("Projects"));
        words.Add(new Word("aa"));
        words.Add(new Word("bb"));
        words.Add(new Word("c"));

        IEnumerable<Word> query = words.OrderBy(x => x.Name.ToLower()).ToList();

        foreach (Word word in query)
        {
            Console.WriteLine(word.Name);
        }

Gives me the following output:

a
bb
c
Calculator
ccc
Projects
aa
aaa

This is not sorted correctly: The first "a" is correct, but the subsequent "aa" and "aaa" entries are sent to the bottom of the list.

I'm not too knowledgeable about character sets and encoding, so possibly I am making a rookie mistake here. But in that case I do not recognize what that might be, and I would be a bit puzzled as to why the first "a" is ordering correctly, but the second and third "aa" and "aaa" is not!

Further elaboration - Word class

[Serializable()]
public class Word
{
    [System.Xml.Serialization.XmlAttribute("Name")]
    public string Name { get; set; }

    public Word(string name)
    {
        Name = name;
    }

    public Word() { } //Parameter less constructor neccessary for serialization

}

Cause and resolution

Like @Douglas suggested, the problem was resolved by supplying the StringComparer.InvariantCultureIgnoreCase comparer to the OrderBy method.

On further research, it seems both the FindAll and OrderBy methods (possibly others) have problems, when using Danish culture (da-DK). There might be other methods or cultures that fail, but da-DK culture and FindAll + OrderBy methods definitely are not working as intended.

The OrderBy method has the problem as described in this thread (wrongful ordering). The FindAll method has a similar, very strange problem: Assume we have a list of entries: a, aa, aaa and aaaa. When using FindAll(x => x.StartsWith("a")), it will only return "a" NOT aa, aaa and aaaa. If using StartsWith("aa"), it will correctly find aa, as well as aaa and aaaa. When using StartWith("aaa") it will again not find aaaa, only aaa! This seems to be a bug in the framework.

like image 470
Morten Kirsbo Avatar asked Nov 16 '12 21:11

Morten Kirsbo


2 Answers

Could you try replacing:

IEnumerable<Word> query = words.OrderBy(x => x.Name.ToLower()).ToList();

…with:

IEnumerable<Word> query = words.OrderBy(x => x.Name, 
    StringComparer.InvariantCultureIgnoreCase);

There's a very small chance that it's a weird culture issue.

like image 135
Douglas Avatar answered Sep 24 '22 03:09

Douglas


The following code outputs expected result:

class Word
{
    public Word(string str)
    {
        Name = str;
    }

    public string Name { get; private set; }
}

public static void Main(string[] args)
{
    List<Word> words = new List<Word>();

    words.Add(new Word("a"));
    words.Add(new Word("Calculator"));
    words.Add(new Word("aaa"));
    words.Add(new Word("Projects"));
    words.Add(new Word("aa"));
    words.Add(new Word("bb"));
    words.Add(new Word("c"));

    IEnumerable<Word> query = words.OrderBy(x => x.Name.ToLower()).ToList();

    foreach (Word word in query)
    {
        Console.WriteLine(word.Name);
    }
}

Outputs:

a
aa
aaa
bb
c
Calculator
Projects

Update: Ok, mystery solved (kind of). If you execute the following before your code:

var cultureInfo = new CultureInfo("da-DK");
Thread.CurrentThread.CurrentCulture = cultureInfo;
Thread.CurrentThread.CurrentUICulture = cultureInfo;

You get "incorrect" output:

a
bb
c
Calculator
Projects
aa
aaa

Apparently rules for danish lexicographical comparisons are different. Here's an explanation I've found on the net (http://stackoverflow.com/questions/4064633/string-comparison-in-java):

Note that this is very dependent on the active locale. For instance, here in Denmark we have a character "å" which used to be spelled as "aa" and is very distinct from two single a's. Hence Danish sorting rules treat two consequtive a's identically to an "å", which means that it goes after z. This also means that Danish dictionaries are sorted differently than English or Swedish ones.

like image 40
Grozz Avatar answered Sep 25 '22 03:09

Grozz