Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimize entity framework query

I'm trying to make a stackoverflow clone in my own time to learn EF6 and MVC5, i'm currently using OWin for authentication.

Everything works fine when i have like 50-60 questions, i used Red Gate data generator and try to ramp it up to 1million questions with a couple of thousands of child table rows without relationship just to 'stress' the ORM a bit. Here's how the linq looks like

var query = ctx.Questions
               .AsNoTracking()     //read-only performance boost.. http://visualstudiomagazine.com/articles/2010/06/24/five-tips-linq-to-sql.aspx
               .Include("Attachments")                                
               .Include("Location")
               .Include("CreatedBy") //IdentityUser
               .Include("Tags")
               .Include("Upvotes")
               .Include("Upvotes.CreatedBy")
               .Include("Downvotes")
               .Include("Downvotes.CreatedBy")
               .AsQueryable();

if (string.IsNullOrEmpty(sort)) //default
{
    query = query.OrderByDescending(x => x.CreatedDate);
}
else
{
    sort = sort.ToLower();
    if (sort == "latest")
    {
        query = query.OrderByDescending(x => x.CreatedDate);
    }
    else if (sort == "popular")
    {
        //most viewed
        query = query.OrderByDescending(x => x.ViewCount);
    }
}

var complaints = query.Skip(skipCount)
                      .Take(pageSize)
                      .ToList(); //makes an evaluation..

Needless to say i'm getting SQL timeouts and after installing Miniprofiler, and look at the sql statement generated, it's a monstrous few hundred lines long.

I know i'm joining/including too many tables, but how many projects in real life, we only have to join 1 or 2 tables? There might be situations where we have to do this many joins with multi-million rows, is going stored procedures the only way?

If that's the case, would EF itself be only suitable for small scale projects?

like image 894
Lee Gary Avatar asked Mar 04 '14 01:03

Lee Gary


4 Answers

Most likely the problem you are experiencing is a Cartesian product.

Based on just some sample data:

var query = ctx.Questions // 50 
  .Include("Attachments") // 20                                
  .Include("Location") // 10
  .Include("CreatedBy") // 5
  .Include("Tags") // 5
  .Include("Upvotes") // 5
  .Include("Upvotes.CreatedBy") // 5
  .Include("Downvotes") // 5
  .Include("Downvotes.CreatedBy") // 5

  // Where Blah
  // Order By Blah

This returns a number of rows upwards of

50 x 20 x 10 x 5 x 5 x 5 x 5 x 5 x 5 = 156,250,000

Seriously... that is an INSANE number of rows to return.

You really have two options if you are having this issue:

First: The easy way, rely on Entity-Framework to wire up models automagically as they enter the context. And afterwards, use the entities AsNoTracking() and dispose of the context.

// Continuing with the query above:

var questions = query.Select(q => q);
var attachments = query.Select(q => q.Attachments);
var locations = query.Select(q => q.Locations);

This will make a request per table, but instead of 156 MILLION rows, you only download 110 rows. But the cool part is they are all wired up in EF Context Cache memory, so now the questions variable is completely populated.

Second: Create a stored procedure that returns multiple tables and have EF materialize the classes.

New Third: EF Now support splitting queries as above, while keeping the nice .Include() methods. Split Queries do have a few gotcha's so I recommend reading all the documentation.

Example from the above link:

If a typical blog has multiple related posts, rows for these posts will duplicate the blog's information. This duplication leads to the so-called "cartesian explosion" problem.

using (var context = new BloggingContext())
{
    var blogs = context.Blogs
        .Include(blog => blog.Posts)
        .AsSplitQuery()
        .ToList();
}

It will produce the following SQL:

SELECT [b].[BlogId], [b].[OwnerId], [b].[Rating], [b].[Url]
FROM [Blogs] AS [b]
ORDER BY [b].[BlogId]

SELECT [p].[PostId], [p].[AuthorId], [p].[BlogId], [p].[Content], [p].[Rating], [p].[Title], [b].[BlogId]
FROM [Blogs] AS [b]
INNER JOIN [Post] AS [p] ON [b].[BlogId] = [p].[BlogId]
ORDER BY [b].[BlogId]
like image 164
Erik Philips Avatar answered Oct 19 '22 16:10

Erik Philips


I don't see anything obviously wrong with your LINQ query (.AsQueryable() shouldn't be mandatory, but it won't change anything if you remove it). Of course, don't include unnecessary navigation properties (each one adds a SQL JOIN), but if everything is required, it should be OK.

Now as the C# code looks OK, it's time to see the generated SQL code. As you already did, the first step is to retrieve the SQL query that is executed. There are .Net ways of doing it, for SQL Server I personally always starts a SQL Server profiling session.

Once you have the SQL query, try to execute it directly against your database, and don't forget to include the actual execution plan. This will show you exactly which part of your query takes the majority of the time. It will even indicate you if there are obvious missing indexes.

Now the question is, should you add all these indexes your SQL Server tells you they are missing? Not necessarily. See for example Don't just blindly create those missing indexes. You'll have to choose which indexes should be added, which shouldn't.

As code-first approach created indexes for you, I'm assuming those are indexes on the primary and foreign keys only. That's a good start, but that's not enough. I don't known about the number of rows in your tables, but an obvious index that only you can add (no code-generation tool can do that because it's related to your business queries), is for example an index on the CreatedDate column, as you're ordering your items by this value. If you don't, SQL Server will have to execute a table scan on 1M rows, which will of course be disastrous in terms of performances.

So :

  • try to remove some Include if you can
  • look at the actual execution plan to see where is the performance issue in your query
  • add only the missing indexes that make sense, depending on how you're ordering/filtering the data you're getting from the DB
like image 24
ken2k Avatar answered Oct 19 '22 16:10

ken2k


As you already know, Include method generate monstrous SQL.

Disclaimer: I'm the owner of the project Entity Framework Plus (EF+)

The EF+ Query IncludeOptimized method allows optimizing the SQL generated exactly like EF Core does.

Instead of generating a monstrous SQL, multiple SQL are generated (one for each include). This feature also as a bonus, it allows filtering related entities.

Docs: EF+ Query IncludeOptimized

var query = ctx.Questions
               .AsNoTracking()
               .IncludeOptimized(x => x.Attachments)                                
               .IncludeOptimized(x => x.Location)
               .IncludeOptimized(x => x.CreatedBy) //IdentityUser
               .IncludeOptimized(x => x.Tags)
               .IncludeOptimized(x => x.Upvotes)
               .IncludeOptimized(x => x.Upvotes.Select(y => y.CreatedBy))
               .IncludeOptimized(x => x.Downvotes)
               .IncludeOptimized(x => x.Downvotes.Select(y => y.CreatedBy))
               .AsQueryable();
like image 7
Jonathan Magnan Avatar answered Oct 19 '22 14:10

Jonathan Magnan


Take a look at section 8.2.2 of this document from Microsoft:

8.2.2 Performance concerns with multiple Includes

When we hear performance questions that involve server response time problems, the source of the issue is frequently queries with multiple Include statements. While including related entities in a query is powerful, it's important to understand what's happening under the covers.

It takes a relatively long time for a query with multiple Include statements in it to go through our internal plan compiler to produce the store command. The majority of this time is spent trying to optimize the resulting query. The generated store command will contain an Outer Join or Union for each Include, depending on your mapping. Queries like this will bring in large connected graphs from your database in a single payload, which will acerbate any bandwidth issues, especially when there is a lot of redundancy in the payload (i.e. with multiple levels of Include to traverse associations in the one-to-many direction).

You can check for cases where your queries are returning excessively large payloads by accessing the underlying TSQL for the query by using ToTraceString and executing the store command in SQL Server Management Studio to see the payload size. In such cases you can try to reduce the number of Include statements in your query to just bring in the data you need. Or you may be able to break your query into a smaller sequence of subqueries, for example:

Before breaking the query:

using (NorthwindEntities context = new NorthwindEntities()) {
var customers = from c in context.Customers.Include(c => c.Orders)
                where c.LastName.StartsWith(lastNameParameter)
                select c;

foreach (Customer customer in customers)
{
    ...
} }

After breaking the query:

using (NorthwindEntities context = new NorthwindEntities()) {
var orders = from o in context.Orders
             where o.Customer.LastName.StartsWith(lastNameParameter)
             select o;

orders.Load();

var customers = from c in context.Customers
                where c.LastName.StartsWith(lastNameParameter)
                select c;

foreach (Customer customer in customers)
{
    ...
} }

This will work only on tracked queries, as we are making use of the ability the context has to perform identity resolution and association fixup automatically.

As with lazy loading, the tradeoff will be more queries for smaller payloads. You can also use projections of individual properties to explicitly select only the data you need from each entity, but you will not be loading entities in this case, and updates will not be supported.

like image 4
adam0101 Avatar answered Oct 19 '22 15:10

adam0101