Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Should the order of LINQ query clauses affect Entity Framework performance?

Tags:

I'm using Entity Framework (code first) and finding the order I specify clauses in my LINQ queries is having a huge performance impact, so for example:

using (var db = new MyDbContext()) {     var mySize = "medium";     var myColour = "vermilion";     var list1 = db.Widgets.Where(x => x.Colour == myColour && x.Size == mySize).ToList();     var list2 = db.Widgets.Where(x => x.Size == mySize && x.Colour == myColour).ToList(); } 

Where the (rare) colour clause precedes the (common) size clause it's fast, but the other way round it's orders of magnitude slower. The table has a couple of million rows and the two fields in question are nvarchar(50), so not normalised but they are each indexed. The fields are specified in a code first fashion as follows:

    [StringLength(50)]     public string Colour { get; set; }      [StringLength(50)]     public string Size { get; set; } 

Am I really supposed to have to worry about such things in my LINQ queries, I thought that was the database's job?

System specs are:

  • Visual Studio 2010
  • .NET 4
  • EntityFramework 6.0.0-beta1
  • SQL Server 2008 R2 Web (64 bit)

Update:

Right, to any gluttons for punishment the effect can be replicated as below. The issue seems to be tremendously sensitive to a number of factors so please bear with the contrived nature of some of this:

Install EntityFramework 6.0.0-beta1 via nuget, then generate code first style with:

public class Widget {     [Key]     public int WidgetId { get; set; }      [StringLength(50)]     public string Size { get; set; }      [StringLength(50)]     public string Colour { get; set; } } 

public class MyDbContext : DbContext {     public MyDbContext()         : base("DefaultConnection")     {     }      public DbSet<Widget> Widgets { get; set; } } 

Generate the dummy data with the following SQL:


insert into gadget (Size, Colour) select RND1 + ' is the name is this size' as Size, RND2 + ' is the name of this colour' as Colour from (Select top 1000000 CAST(abs(Checksum(NewId())) % 100 as varchar) As RND1, CAST(abs(Checksum(NewId())) % 10000 as varchar) As RND2 from master..spt_values t1 cross join master..spt_values t2) t3 

Add one index each for Colour and Size, then query with:


string mySize = "99 is the name is this size"; string myColour = "9999 is the name of this colour"; using (var db = new WebDbContext()) {     var list1= db.Widgets.Where(x => x.Colour == myColour && x.Size == mySize).ToList(); } using (var db = new WebDbContext()) {     var list2 = db.Widgets.Where(x => x.Size == mySize && x.Colour == myColour).ToList(); } 

The issue seems connected with the obtuse collection of NULL comparisons in the generated SQL, as below.

exec sp_executesql N'SELECT  [Extent1].[WidgetId] AS [WidgetId],  [Extent1].[Size] AS [Size],  [Extent1].[Colour] AS [Colour] FROM [dbo].[Widget] AS [Extent1] WHERE ((([Extent1].[Size] = @p__linq__0)  AND ( NOT ([Extent1].[Size] IS NULL OR @p__linq__0 IS NULL)))  OR (([Extent1].[Size] IS NULL) AND (@p__linq__0 IS NULL)))  AND ((([Extent1].[Colour] = @p__linq__1) AND ( NOT ([Extent1].[Colour] IS NULL  OR @p__linq__1 IS NULL))) OR (([Extent1].[Colour] IS NULL)  AND (@p__linq__1 IS NULL)))',N'@p__linq__0 nvarchar(4000),@p__linq__1 nvarchar(4000)', @p__linq__0=N'99 is the name is this size', @p__linq__1=N'9999 is the name of this colour' go 

Changing the equality operator in the LINQ to StartWith() makes the problem go away, as does changing either one of the two fields to be non nullable at the database.

I despair!

Update 2:

Some assistance for any bounty hunters, the issue can be reproduced on SQL Server 2008 R2 Web (64 bit) in a clean database, as follows:

CREATE TABLE [dbo].[Widget](     [WidgetId] [int] IDENTITY(1,1) NOT NULL,     [Size] [nvarchar](50) NULL,     [Colour] [nvarchar](50) NULL,  CONSTRAINT [PK_dbo.Widget] PRIMARY KEY CLUSTERED  (     [WidgetId] ASC )WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY] ) ON [PRIMARY] GO CREATE NONCLUSTERED INDEX IX_Widget_Size ON dbo.Widget     (     Size     ) WITH( STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] GO CREATE NONCLUSTERED INDEX IX_Widget_Colour ON dbo.Widget     (     Colour     ) WITH( STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] GO   insert into Widget (Size, Colour) select RND1 + ' is the name is this size' as Size, RND2 + ' is the name of this colour' as Colour from (Select top 1000000 CAST(abs(Checksum(NewId())) % 100 as varchar) As RND1, CAST(abs(Checksum(NewId())) % 10000 as varchar) As RND2 from master..spt_values t1 cross join master..spt_values t2) t3 GO 

and then compare the relative performance of the following two queries (you may need to adjust the parameter test values in order to get a query which returns a couple of rows in order to observe the effect, i.e. the second query id much slower).

exec sp_executesql N'SELECT  [Extent1].[WidgetId] AS [WidgetId],  [Extent1].[Size] AS [Size],  [Extent1].[Colour] AS [Colour] FROM [dbo].[Widget] AS [Extent1] WHERE ((([Extent1].[Colour] = @p__linq__0)  AND ( NOT ([Extent1].[Colour] IS NULL  OR @p__linq__0 IS NULL)))  OR (([Extent1].[Colour] IS NULL)  AND (@p__linq__0 IS NULL)))  AND ((([Extent1].[Size] = @p__linq__1)  AND ( NOT ([Extent1].[Size] IS NULL  OR @p__linq__1 IS NULL)))  OR (([Extent1].[Size] IS NULL) AND (@p__linq__1 IS NULL)))', N'@p__linq__0 nvarchar(4000),@p__linq__1 nvarchar(4000)', @p__linq__0=N'9999 is the name of this colour', @p__linq__1=N'99 is the name is this size' go  exec sp_executesql N'SELECT  [Extent1].[WidgetId] AS [WidgetId],  [Extent1].[Size] AS [Size],  [Extent1].[Colour] AS [Colour] FROM [dbo].[Widget] AS [Extent1] WHERE ((([Extent1].[Size] = @p__linq__0)  AND ( NOT ([Extent1].[Size] IS NULL  OR @p__linq__0 IS NULL)))  OR (([Extent1].[Size] IS NULL)  AND (@p__linq__0 IS NULL)))  AND ((([Extent1].[Colour] = @p__linq__1)  AND ( NOT ([Extent1].[Colour] IS NULL  OR @p__linq__1 IS NULL)))  OR (([Extent1].[Colour] IS NULL)  AND (@p__linq__1 IS NULL)))', N'@p__linq__0 nvarchar(4000),@p__linq__1 nvarchar(4000)', @p__linq__0=N'99 is the name is this size', @p__linq__1=N'9999 is the name of this colour' 

You may also find, as I do, that if you rerun the dummy data insert so that there are now two million rows, the problem goes away.

like image 678
stovroz Avatar asked Jun 26 '13 14:06

stovroz


People also ask

Does LINQ order matter?

Yes. But exactly what that performance difference is depends on how the underlying expression tree is evaluated by the LINQ provider. For instance, your query may well execute faster the second time (with the WHERE clause first) for LINQ-to-XML, but faster the first time for LINQ-to-SQL.

What order are LINQ keywords in to perform a query?

The LINQ query syntax starts with from keyword and ends with select keyword. The following is a sample LINQ query that returns a collection of strings which contains a word "Tutorials". The following figure shows the structure of LINQ query syntax. Query syntax starts with a From clause followed by a Range variable.

Which is faster LINQ or Entity Framework?

LINQ To SQL is slow for the first time run. After first run provides acceptable performance. Entity Framework is also slow for the first run, but after first run provides slightly better performance compared to LINQ To SQL.


2 Answers

The core of the question is not "why does the order matter with LINQ?". LINQ just translates literally without reordering. The real question is "why do the two SQL queries have different performance?".

I was able to reproduce the problem by only inserting 100k rows. In that case a weakness in the optimizer is being triggered: it does not recognize that it can do a seek on Colour due to the complex condition. In the first query the optimizer does recognize the pattern and creates an index seek.

There is no semantic reason why this should be. A seek on an index is possible even when seeking on NULL. This is a weakness/bug in the optimizer. Here are the two plans:

enter image description here

EF tries to be helpful here because it assumes that both the column and the filter variable can be null. In that case it tries to give you a match (which according to C# semantics is the right thing).

I tried undoing that by adding the following filter:

Colour IS NOT NULL AND @p__linq__0 IS NOT NULL AND Size IS NOT NULL AND @p__linq__1 IS NOT NULL 

Hoping that the optimizer now uses that knowledge to simplify the complex EF filter expression. It did not manage to do so. If this had worked the same filter could have been added to the EF query providing an easy fix.

Here are the fixes the I recommend in the order that you should try them:

  1. Make the database columns not-null in the database
  2. Make the columns not-null in the EF data model hoping that this will prevent EF from creating the complex filter condition
  3. Create indexes: Colour, Size and/or Size, Colour. They also remove them problem.
  4. Ensure that the filtering is done in the right order and leave a code comment
  5. Try to use INTERSECT/Queryable.Intersect to combine the filters. This often results in different plan shapes.
  6. Create an inline table-valued function that does the filtering. EF can use such a function as part of a bigger query
  7. Drop down to raw SQL
  8. Use a plan guide to change the plan

All of these are workarounds, not root cause fixes.

In the end I am not happy with both SQL Server and EF here. Both products should be fixed. Alas, they likely won't be and you can't wait for that either.

Here are the index scripts:

CREATE NONCLUSTERED INDEX IX_Widget_Colour_Size ON dbo.Widget     (     Colour, Size     ) WITH( STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] CREATE NONCLUSTERED INDEX IX_Widget_Size_Colour ON dbo.Widget     (    Size, Colour     ) WITH( STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] 
like image 86
usr Avatar answered Oct 05 '22 03:10

usr


Note: I ran into this question long after others have already provided generally correct answers. I decided to post this as a separate answer only because I think the workaround can be helpful, and because you might appreciate having a better insight on the reason EF behaves this way.

Short answer: The best workaround for this issue is to set this flag on your DbContext instance:

context.Configuration.UseDatabaseNullSemantics = true; 

When you do this all the extra null checks will go away and your queries should perform faster if the were affected by this issue.

Long answer: others in this thread are right that in EF6 we have introduced the extra null checking terms by default to compensate for differences between the semantics of null comparisons in the database (three-valued logic) and standard in-memory null comparisons. The goal of this is to satisfy the following very popular request:

Incorrect handling of null variables in 'where' clause

Paul White is also right that the in the following expression the 'AND NOT' part is less common in for compensating for three-valued logic:

((x = y) AND NOT (x IS NULL OR y IS NULL)) OR (x IS NULL AND y IS NULL) 

That extra condition is necessary in the general case to prevent the result from the whole expression to be NULL, e.g. assume that x = 1 and y = NULL. Then

(x = y) --> NULL  (x IS NULL AND y IS NULL) --> false NULL OR false --> NULL 

The distinction between NULL and false is important in case the comparison expression is negated at a later point in the composition of the query expression, e.g.:

NOT (false) --> true  NOT (NULL) --> NULL 

It is also true that we could potentially add the smarts to EF to figure out when this extra term is unnecessary (e.g. if we know that the expression isn't negated in the predicate of the query) and to optimize it out of the query.

By the way, we are tracking this issue in the following EF bug at codeplex:

[Performance] Reduce the expression tree for complex queries in case of C# null comparison semantics

like image 35
divega Avatar answered Oct 05 '22 02:10

divega