Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Top per group: Take(1) works but FirstOrDefault() doesn't?

I'm using EF 4.3.1... just upgraded to 4.4 (problem remains) with database-first POCO entities generated by the EF 4.x DbContext Generator. I have the following database named 'Wiki' (SQL script to create tables and data is here):

Author(ID, Name) <-- Article(AuthorID, Title, Revision, CreatedUTC, Body)

When a wiki article is edited, instead of its record being updated, the new revision is inserted as a new record with the revision counter incremented. In my database there is one author, "John Doe", which has two articles, "Article A" and "Article B", where article A has two version (1 and 2), but article B has only one version.

enter image description here

I have both lazy loading and proxy creation disabled (here is the sample solution I'm using with LINQPad). I want to get the latest revisions of articles created by people whose name starts with "John", so I do the following query:

Authors.Where(au => au.Name.StartsWith("John"))
       .Select(au => au.Articles.GroupBy(ar => ar.Title)
                                .Select(g => g.OrderByDescending(ar => ar.Revision)
                                              .FirstOrDefault()))

This produces the wrong result, and retrieves only the first article:

enter image description here

Making a small change in the query, by replacing .FirstOrDefault() with .Take(1) results in the following query:

Authors.Where(au => au.Name.StartsWith("John"))
       .Select(au => au.Articles.GroupBy(ar => ar.Title)
                                .Select(g => g.OrderByDescending(ar => ar.Revision)
                                              .Take(1)))

Surprisingly, this query produces correct results (albeit with more nesting):

enter image description here

I assumed EF is generating slightly different SQL queries, one which returns only the latest revision of a single article, the other returning the latest revision of all articles. The ugly SQL generated by the two queries differ only slightly (compare: SQL for .FirstOrDefault() vs SQL for .Take(1)), but they both return the correct result:

.FirstOrDefault()

enter image description here

.Take(1) (column order rearranged for easy comparison)

enter image description here

The culprit therefore is not the generated SQL, but EF's interpretation of the result. Why is EF interpreting the first result into a single Article instance while it interprets the second result as two Article instances? Why does the first query return incorrect results?

EDIT: I have opened a bug report on Connect. Please upvote it if you think it is important to fix this issue.

like image 359
Allon Guralnek Avatar asked Aug 27 '12 08:08

Allon Guralnek


1 Answers

Looking at:
http://msdn.microsoft.com/en-us/library/system.linq.enumerable.firstordefault
http://msdn.microsoft.com/en-us/library/bb503062.aspx
there's very nice explanation on how Take works (lazy, early brekaing) but none of FirstOrDefault.. What's more, seeing the explanation of Take, I'd 'guestimate' that it the queries with Take may cut the number of rows due to an attempt to emulate the lazy evaluation in SQL, and your case indicates it's the other way! I do ont understand why you are observing such effect.

It's probably just implementation-specific.. For me, both Take(1) and FirstOrDefault might look like TOP 1, however from functional point of view, there may be a slight difference in their 'laziness': one function may evaluate all elements and return first, second may evaluate first then return it and break evaluation.. It is only a "hint" on what might have happened. For me, it is a nonsense, because I see no docs on this subject and in general I'm sure that both Take/FirstOrDefault are lazy and should eval only the first N elements.

In the first part of your query, the group.Select+orderBy+TOP1 is a "clear indication" that you are interested in the single row with highest 'value' in a column per group - but in fact, there is no simple way to do declare that in SQL, so the indication is not that clear at all for the SQL engine and for EF engine neither.

As for me, the behaviour you present could indicate that the FirstOrDefault was 'propagated' by the EF translator upwards one layer of inner queries too much, as if to the Articles.GroupBy() (are you sure you have not misplaced parens adter the OrderBy? :) ) - and that would be a bug.

But -

As the difference must be somewhere in the meaning and/or order of execution, let's see what EF can guess about the meaning of your query. How the Author entity gets its Articles? How the EF knows which Article it is to bind to your author? Of course, the nav property. But how it happens that only some of articles are preloaded? Seems simple - the query returns some results with come columns, columns describe whole Author and Whole Articles, so lets map them to authors and articles and lets match them each other vis nav keys. OK. But add the complex filtering to that..?

With simple filter like by-date, it is a single subquery for all articles, rows are truncated by date, and all rows are consumed. But how about writing a complex query that would use several intermediate orderings and a produce several subsets of articles? Which subset should be bound to the resulting Author? Union of all of them? That would nullify all top level where-like clauses. First of them? Nonsense, first subqueries tend to be intermediary helpers. So, probably, when a query is seen as a set of subqueries with similar structure that all could be taken as the datasource for a partial-loading of a nav property, then most probably only the last subquery is taken as the actual result. This is all abstract thinking, but it made me notice that Take() versus FirstOrDefault and their overall Join versus LeftJoin meaning could in fact change the order of result set scanning, and, somehow, Take() was somehow optimized and done in one scan over whole result, thus visiting all author's articles at once, and the FirstOrDefault was executed as direct scan for each author * for each title-group * select top one and check count and substitue for null that had many times produced small one-item collections of articles per each author, and thus resulted in one result - coming only from the last title-grouping visited.

This is the only explanation I can think of, except of obvious "BUG!" shout. As a LINQ-user, for me, it still is a bug. Either such optimization should not have taken place at all, or it should include the FirstOrDef too - as it is the same as Take(1).DefaultIfEmpty(). Heh, by the way - have you tried that? As I said, Take(1) is not same as FirstOrDefault due to the JOIN/LEFTJOIN meaning - but Take(1).DefaultIfEmpty() is actually semantically the same. It could be fun to see what SQL queries it produces at SQL and what results in EF layers.

I have to admit, that selection of the related-entities in partial-loading was never clear to me and I have actually not used the partial-loading for a looong time as always I stated the queries so that the results and groupings are explicitely defined (*).. Hence, I could simply have forgotten about some key aspect/rule/definition of its inner working and maybe, ie. it actually is to select every related record form the result set (not just the last-subcollection as I described now). If I had forgotten something, all what I just described would be obviously wrong.

(*) In your case, I'd make the Article.AuthorID a nav-property too (public Author Author get set), and then rewrite the query similar to be more flat/pipelined, like:

var aths = db.Articles
              .GroupBy(ar => new {ar.Author, ar.Title})
              .Take(10)
              .Select(grp => new {grp.Key.Author, Arts = grp.OrderByDescending(ar => ar.Revision).Take(1)} )

and then fill the View with pairs of Author and Arts separately, instead of trying to partially fill the author and use author-only. Btw. I've not tested it against EF and SServer, it is just an example of 'flipping the query upside down' and 'flattening' the subqueries in case of JOINs and is unusable for LEFTJOINs, so if you'd like to view also the authors without articles, it has to start from the Authors like your original query..

I hope these loose thoughts will help a bit in finding 'why'..

like image 61
quetzalcoatl Avatar answered Oct 20 '22 01:10

quetzalcoatl