Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

LINQ GroupBy extremely slow

Tags:

c#

linq

mono

The following code takes 5 minutes to run on 100,000 rows. That seems crazy to me. What am I doing wrong?

        var query =
            from foo in fooStuff.AsEnumerable()
            group foo by foo.Field<Int64>("FooID") into g
            select new
            {
                    FooID = g.Key,
                    FooTier = g.Min(foo => foo.Field<int>("Tier"))
            };

Note: On Mono.

like image 844
Ben B. Avatar asked Aug 17 '11 17:08

Ben B.


2 Answers

You are materializing all the entities when you call AsEnumerable(), so your grouping is being done in memory. Try removing that part so that the grouping is done at the database level:

var query =
        from foo in fooStuff
        group foo by foo.FooID into g
        select new
        {
                FooID = g.Key,
                FooTier = g.Min(foo => foo.Tier)
        };
like image 85
alf Avatar answered Nov 08 '22 09:11

alf


It is not a direct comparision and isn't on Mono, but I have some code which does something similar with a 6MB xml file which I read into a DataSet and it has 30,000 rows and takes 0.5 seconds, so I don't think it is the groupby itself that causes the problem.

To diagnose further, I would suggest

  • Testing how long it takes to read the information into a list, ie

    var fooList = fooStuff.AsEnumerable().ToList(); 
    
  • Testing how long it takes if you change the query to use fooList instead of fooStuff

  • Testing how long it takes if you remove FooTier = g.Min(foo => foo.Tier) from the select

  • Separate the .Field<> reflection from the groupby and time each section, ie first read the information from the datatable into a list , eg

    var list2 =
    (from foo in fooStuff.AsEnumerable()
    select new { 
        FooID = foo.Field<Int64>("FooID") 
        Tier  = foo.Field<int>("Tier")
    }).ToList();
    

    Then you can query this list

    var query =
    from foo in list2
    group foo by foo.FooID into g
    select new
    {
            FooID = g.Key,
            FooTier = g.Min(foo => foo.Tier)
    };
    var results = query.ToList();
    

If this query is slow, it would suggest that there is something wrong with mono's implementation of GroupBy. You might be able to verify that by using something like this

    public static Dictionary<TKey, List<TSrc>> TestGroupBy<TSrc, TKey>
     (this IEnumerable<TSrc> src, Func<TSrc,TKey> groupFunc)
    {
        var dict= new Dictionary<TKey, List<TSrc>>();

        foreach (TSrc s in src)
        {
            TKey key = groupFunc(s);
            List<TSrc> list ;

            if (!dict.TryGetValue(key, out list))
            {
                list = new List<TSrc>();
                dict.Add(key, list);
            }       
            list.Add(s);        
            }

        return dict;
}

And to use it

  var results = list2.TestGroupBy(r=>r.FooID)
      .Select(r=>  new { FooID = r.Key, FooTier = r.Value.Min(r1=>r1.Tier)} );

Note, this is not meant as a replacement for groupby and does not cope with null keys but should be enough to determine if their is a problem with groupby (assuming mono's implementation of Dictionary and List are ok).

like image 22
sgmoore Avatar answered Nov 08 '22 08:11

sgmoore