Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Entity Framework + LINQ + "Contains" == Super Slow?

Trying to refactor some code that has gotten really slow recently and I came across a code block that is taking 5+ seconds to execute.

The code consists of 2 statements:

IEnumerable<int> StudentIds = _entities.Filters
                    .Where(x => x.TeacherId == Profile.TeacherId.Value && x.StudentId != null)
                    .Select(x => x.StudentId)
                    .Distinct<int>();

and

_entities.StudentClassrooms
                    .Include("ClassroomTerm.Classroom.School.District")
                    .Include("ClassroomTerm.Teacher.Profile")
                    .Include("Student")
                    .Where(x => StudentIds.Contains(x.StudentId)
                    && x.ClassroomTerm.IsActive
                    && x.ClassroomTerm.Classroom.IsActive
                    && x.ClassroomTerm.Classroom.School.IsActive
                    && x.ClassroomTerm.Classroom.School.District.IsActive).AsQueryable<StudentClassroom>();

So it's a bit messy but first I get a Distinct list of Id's from one Table (Filters), then I query another Table using it.

These are relatively small tables, but it's still 5+ seconds of query time.

I put this in LINQPad and it showed that it was doing the bottom query first then running 1000 "distinct" queries afterwards.

On a whim I changed the "StudentIds" code by just adding .ToArray() at the end. This improved the speed 1000x ... it now takes like 100ms to complete the same query.

What's the deal? What am I doing wrong?

like image 972
Alexandru Petrescu Avatar asked Apr 09 '12 22:04

Alexandru Petrescu


2 Answers

This is one of the pitfalls of deferred execution in Linq: In your first approach StudentIds is really an IQueryable, not an in-memory collection. That means using it in the second query will run the query again on the database - each and every time.

Forcing execution of the first query by using ToArray() makes StudentIds an in-memory collection and the Contains part in your second query will run over this collection that contains a fixed sequence of items - This gets mapped to something equivalent to a SQL where StudentId in (1,2,3,4) query.

This query will of course, be much much faster since you determined this sequence once up-front, and not every time the Where clause is executed. Your second query without using ToArray() (I would think) would be mapped to a SQL query with an where exists (...) sub-query that gets evaluated for each row.

like image 187
BrokenGlass Avatar answered Sep 18 '22 12:09

BrokenGlass


ToArray() Materializes the initial query to the server memory.

My guess would be the query provider is not able to parse the expression StudentIds.Contains(x.StudentId). Hence it probably thinks that the studentIds is an array already loaded to memory. So it's probably querying the database over and over again during the parsing phase. The only way to know for sure is to setup the profiler.

If you need to do this on the db server, use a join, instead of "contains". If you need to use contains to do what looks like a join problem, you are likely to be missing a surrogate primary key or a foreign key somewhere.

You could also declare studentIds as IQueryable instead of IEnumerable. This might give the query provider the hint it needs to interpret the studentIds as expression aka. data not already loaded to memory. I somehow doubt this but worth a try.

If all else fails, use ToArray(). This will load the initial studentIds to memory.

like image 36
Alwyn Avatar answered Sep 19 '22 12:09

Alwyn