Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Rules for LINQ to SQL across method boundaries

Tags:

c#

linq-to-sql

To keep my code cleaner I often try to break down parts of my data access code in LINQ to SQL into private sub-methods, just like with plain-old business logic code. Let me give a very simplistic example:

public IEnumerable<Item> GetItemsFromRepository()
{
    var setA = from a in this.dataContext.TableA
               where /* criteria */
               select a.Prop;

    return DoSubquery(setA);
}

private IEnumerable<Item> DoSubQuery(IEnumerable<DateTimeOffset> set)
{
     return from item in set
            where /* criteria */
            select item;
}

I'm sure no one's imagination would be stretched by imagining more complex examples with deeper nesting or using results of sets to filter other queries.

My basic question is this: I've seen some significant performance differences and even exceptions being thrown by just simply reorganizing LINQ to SQL code in private methods. Can anyone explain the rules for these behaviors so that I can make informed decisions about how to write efficient, clean data access code?

Some questions I've had:

1) When does passage of System.Linq.Table instace to a method cause query execution?

2) When does using a System.Linq.Table in another query cause execution?

3) Are there limits to what types of operations (Take, First, Last, order by, etc.) can be applied to System.Linq.Table passed a parameters into a method?

like image 426
Brian Triplett Avatar asked Aug 13 '12 17:08

Brian Triplett


2 Answers

The most important rule in terms of LINQ-to-SQL would be: don't return IEnumerable<T> unless you must - as the semantic is unclear. There are two schools of thought beyond that:

  • if you return IQueryable<T>, it is composable, meaning the where from later queries is combined to make a single TSQL, but as a down-side, it is hard to fully test
  • otherwise, return List<T> or similar, so it is clear that everything beyond that point is LINQ-to-Objects

Currently, you are doing something in the middle: collapsing it to LINQ-to-Objects (via IEnumerable<T>), but without it being obvious - and keeping the connection open in the middle (again, only a problem because it isn't obvious)

like image 143
Marc Gravell Avatar answered Nov 03 '22 00:11

Marc Gravell


Remove the implicit cast:

public IQueryable<Item> GetItemsFromRepository()
{
    var setA = from a in this.dataContext.TableA
               where /* criteria */
               select a.Prop;

    return DoSubquery(setA);
}

private IQueryable<Item> DoSubQuery(IQueryable<DateTimeOffset> set)
{
     return from item in set
            where /* criteria */
            select item;
}

The implicit cast from IQueryable<Item> to IEnumerable<Item> is essentially the same as calling AsEnumerable() on your IQueryable<Item>. There are of course times when you want that, but you should leave things as IQueryable by default, so that the entire query can be performed on the database, rather than merely the GetItemsFromRepository() bit with the rest being done in memory.

The secondary questions:

1) When does passage of System.Linq.Table instace to a method cause query execution?

When something needs a final result, such as Max(), ToList(), etc. that is neither a queryable object, nor a loaded-as-it-goes enumerable.

Note though, that while AsEnumerable() does not cause query execution, it does mean that when execution does happen only that before the AsEnumerable() will be performed against the source datasource, this will then produce an on-demand in-memory datasource against which the rest will be performed.

2) When does using a System.Linq.Table in another query cause execution?

The same as above. Table<T> implements IQueryable<T>. If you e.g. join two of them together, that won't yet cause anything to be executed.

3) Are there limits to what types of operations (Take, First, Last, order by, etc.) can be applied to System.Linq.Table passed a parameters into a method?

Those that are definted by IQueryable<T>.

Edit: Some clarification on the differences and similarities between IEnumerable and IQueryable.

Just about anything you can do on an IQueryable you can do on an IEnumerable and vice-versa, but how it's performed will be different.

Any given IQueryable implementation can be used in linq queries and will have all the linqy extension methods like Take(), Select(), GroupBy and so on.

Just how this is done, depends on the implementation. For example, System.Linq.Data.Table implements those methods by the query being turned into an SQL query, the results of which are turned into a objects on a as-loaded basis. So if mySource is a table then:

var filtered = from item in mySource
  where item.ID < 23
  select new{item.ID, item.Name};

foreach(var i in filtered)
  Console.WriteLine(i.Name);

Gets turned into SQL like:

select id, name from mySourceTable where id < 23

And then an enumerator is created from that such that on each call to MoveNext() another row is read from the results, and a new anonymous object created from it.

On the other hand, if mySource where a List or a HashSet, or anything else that implements IEnumerable<T> but doesn't have its own query engine, then the linq-to-objects code will turn it into something like:

foreach(var item in mySource)
  if(item.ID < 23)
    yield return new {item.ID, item.Name};

Which is about as efficiently as that code could be done in memory. The results will be the same, but the way to get them, would be different:

Now, since all IQueryable<T> can be converted into the equivalent IEnumerable<T> we could, if we wanted to, take the first mySource (where execution happens in a database) and do the following instead:

var filtered = from item in mySource.AsEnumerable()
  where item.ID < 23
  select new{item.ID, item.Name};

Here, while there is still nothing executed against the database until we iterate through the results or call something that examines all of those results, once we do so, it's as if we split the execution into two separate steps:

var asEnum = mySource.AsEnumerable();
var filtered = from item in asEnum
  where item.ID < 23
  select new{item.ID, item.Name};

The implemenatation of the first line would be to execute the SQL SELECT * FROM mySourceTable, and the execution of the rest would be like the linq-to-objects example above.

It's not hard to see how, if the database contained 10 items with an id < 23, and 50,000 items with an id higher, this is now much, much less performant.

As well as offering the explicity AsEnumerable() method, all IQueryable<T> can be implicitly cast to IEnumerable<T>. This lets us do foreach on them and use them with any other existing code that handles IEnumerable<T>, but if we accidentally do it at in inappropriate time, we can make queries much slower, and this is what was happening when your DoSubQuery was defined to take an IEnumerable<DateTimeOffset> and return an IEnumerable<Item>; it implicitly called AsEnumerable() on your IQueryable<DateTimeOffset> and your IQueryable<Item> and caused what could have been performed on the database to be performed in-memory.

For this reason, 99% of the time, we want to keep dealing in IQueryable until the very last moment.

As an example of the opposite though, just to point out that AsEnumerable() and the casts to IEnumerable<T> aren't there out of madness, we should consider two things. The first is that IEnumerable<T> lets us do things that can't be done otherwise, such as joining two completely different sources that don't know about each other (e.g. two different databases, a database and an XML file, etc.)

Another is that sometimes IEnumerable<T> is actually more efficient too. Consider:

IQueryable<IGrouping<string, int>> groupingQuery = from item in mySource select item.ID group by item.Name;
var list1 = groupingQuery.Select(grp => new {Name=grp.Key, Count=grp.Count()}).ToList();//fine
foreach(var grp in groupingQuery)//disaster!
  Console.WriteLine(grp.Count());

Here groupingQuery is set up as a queryable that does some grouping, but which hasn't been executed in anyway. When we create list1, then first we create a new IQueryable based on that, and the query engine does it's best to work out what the best SQL for it is, and comes up with something like:

select name, count(id) from mySourceTable group by name

Which is pretty efficiently performed. Then the rows are turned into objects, which are then put into a list.

On the other hand, with the second query, there isn't as natural a SQL conversion for a group by that doesn't perform aggregate methods on all of the non-grouped items, so the best the query engine can come up with is to first do:

select distinct name from mySourceTable,

And then for every name it receives, to do:

select id from mySourceTable where name = '{name found in last query goes here}'

And so on, should this mean 2 SQL queries, or 200,000.

In this case, we're much better working on mySource.AsEnumerable() because here it is more efficient to grab the whole table into memory first. (Even better still would be to work on mySource.Select(item => new {item.ID, item.Name}).AsEnumerable() because then we still only retrieve the columns we care about from the database, and switch to in-memory at that point).

The last bit is worth remembering because it breaks our rule that we should stay with IQueryable<T> as long as possible. It isn't something to worry about much, but it is worth keeping an eye on if you do grouping and find yourself with a very slow query.

like image 45
Jon Hanna Avatar answered Nov 03 '22 01:11

Jon Hanna