Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does LINQ GroupBy produce different results when preceded by ToArray()?

I'm pulling some data from SQL using Entity Framework.

I've written some code that looks like the snippet below. Note that Something is a varchar value from the database. Furthermore, I think it may be relevant that every value in Something consists of ten digits, a hyphen, and then two more digits, such as "0123456789-01". (Don't remind me this is bad design; I'm not the architect of this database.)

var X = Entities.MyView
    .Select(x => x.Something)
    .Distinct();
// 5850 elements

var Y = Entities.MyView
    .GroupBy(x => x.Something);
// 5850 elements

var Z = Entities.MyView
    .ToArray()
    .GroupBy(x => x.Something);
// 5727 elements

// Added in response to user995219's Answer
var ZZ = Entities.MyView
    .GroupBy(x => x.Something)
    .ToArray();
    // 5850 elements

The first statement pulls down the distinct values of the Something column from the view MyView. It gets me the result 5850 elements, which is what I expect.

The second statement pulls down the whole view and groups it by unique values of Something, yielding 5850 groupings, which is what I expect.

The third statement is exactly the same, except it calls ToArray on the values from MyView. Normally, I would do this when I expect the calling code to use the entirety of the data and don't want to deal with any Lazy Loading SNAFUs. However, when I run this code in the debugger, this third statement yields 5727 groupings, 123 less than I expect.

Edit:The fourth statement simply reverses the order of the GroupBy and ToArray method chaining. Making this small change, I get the number of rows that I expect, but the real problem that this question cannot be solved this way because the real problem has code that associates data from this SQL call with other data and returns them in a single object.

I wish to understand how it is that using ToArray causes me to get the wrong number of results.

Follow-up: In response to user995219's answer, I rewrote MyView so that it has an additional column: SomethingInt, which simply containins a 32-bit integer formed by omitting the hyphen from Something and considering the result as a single integer. Then, the LINQ code does the GroupBy on the newly minted integer.

However, even with this change, I'm still having the same problem.

var A = Entities.MyView
    .ToArray();
// Returns 17893 elements, the same as if I ran SELECT * FROM MyView

var array0 = A.Select(x => x.SomethingInt).Distinct();
// Returns 5727 elements when I expect 5850

In SQL, SELECT COUNT(DISTINCT(SomethingInt)) FROM MyView; returns 5850, as I expect.

This should take any kind of string-comparison in the .net framework out of the picture, but the problem persists.

like image 299
Vivian River Avatar asked Feb 07 '14 01:02

Vivian River


2 Answers

The ToArray statement executes the query and the group by is performed in memory using string equality comparer. The first two statements perform the group by at SQL level. There are differences: String Comparison differences between .NET and T-SQL?

Particularly concerning collation.

like image 105
user995219 Avatar answered Sep 19 '22 01:09

user995219


I found the answer. user995219's answer was useful, but not the full explanation.

Apparently, LINQ methods check the contents of what they are operating on. In my case, I'm using classes generated by Entity Framework. These have "Entity Keys", which allow the .net framework to distinguish between two rows that have the same contents and two instances of the same row.

In my case, I was using a complicated view and the .net framework inferred the entity keys incorrectly and then discarded rows because it thought they were the same.

The solution for me was to modify my view so that there is a GUID that uniquely identifies each row and use the GUIDs as an entity key.

like image 32
Vivian River Avatar answered Sep 20 '22 01:09

Vivian River