Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

LINQ generating SQL with duplicate nested selects

I'm very new to the .NET Entity Framework, and I think it's awesome, but somehow I'm getting this strange issue (sorry for the spanish but my program is in that language, anyway it's not a big deal, just the column or property names): I'm doing a normal LINQ To Entities query to get a list of UltimaConsulta, like this:

var query = from uc in bd.UltimasConsultas
            select uc;

UltimasConsultas is a view, btw. The thing is that LINQ is generating this SQL for the query:

SELECT 
[Extent1].[IdPaciente] AS [IdPaciente], 
[Extent1].[Nombre] AS [Nombre], 
[Extent1].[PrimerApellido] AS [PrimerApellido], 
[Extent1].[SegundoApellido] AS [SegundoApellido], 
[Extent1].[Fecha] AS [Fecha]
FROM (SELECT 
      [UltimasConsultas].[IdPaciente] AS [IdPaciente], 
      [UltimasConsultas].[Nombre] AS [Nombre], 
      [UltimasConsultas].[PrimerApellido] AS [PrimerApellido], 
      [UltimasConsultas].[SegundoApellido] AS [SegundoApellido], 
      [UltimasConsultas].[Fecha] AS [Fecha]
      FROM [dbo].[UltimasConsultas] AS [UltimasConsultas]) AS [Extent1]

Why is LINQ generating a nested Select? I thought from videos and examples that it generates normal SQL selects for this kind of queries. Do I have to configure something (the entity model was generating from a wizard, so it's default configuration)? Thanks in advance for your answers.

like image 530
Alfonso Vasquez Avatar asked Jan 21 '10 20:01

Alfonso Vasquez


1 Answers

To be clear, LINQ to Entities does not generate the SQL. Instead, it generates an ADO.NET canonical command tree, and the ADO.NET provider for your database, presumably SQL Server in this case, generates the SQL.

So why does it generate this derived table (I think "derived table" is the more correct term for the SQL feature in use here)? Because the code which generates the SQL has to generate SQL for a wide variety of LINQ queries, most of which are not nearly as trivial as the one you show. These queries will often be selecting data for multiple types (many of which might be anonymous, rather than named types), and in order to keep the SQL generation relatively sane, they are grouped into extents for each type.

Another question: Why should you care? It's easy to demonstrate that the use of the derived table in this statement is "free" from a performance point of view.

I selected a table at random from a populated database, and run the following query:

SELECT [AddressId]
      ,[Address1]
      ,[Address2]
      ,[City]
      ,[State]
      ,[ZIP]
      ,[ZIPExtension]
  FROM [VertexRM].[dbo].[Address]

Let's look at the cost:

<StmtSimple StatementCompId="1" StatementEstRows="7900" StatementId="1" StatementOptmLevel="TRIVIAL" StatementSubTreeCost="0.123824" StatementText="/****** Script for SelectTopNRows command from SSMS  ******/&#xD;&#xA;SELECT [AddressId]&#xD;&#xA;      ,[Address1]&#xD;&#xA;      ,[Address2]&#xD;&#xA;      ,[City]&#xD;&#xA;      ,[State]&#xD;&#xA;      ,[ZIP]&#xD;&#xA;      ,[ZIPExtension]&#xD;&#xA;  FROM [VertexRM].[dbo].[Address]" StatementType="SELECT">
  <StatementSetOptions ANSI_NULLS="false" ANSI_PADDING="false" ANSI_WARNINGS="false" ARITHABORT="true" CONCAT_NULL_YIELDS_NULL="false" NUMERIC_ROUNDABORT="false" QUOTED_IDENTIFIER="false" />
  <QueryPlan CachedPlanSize="9" CompileTime="0" CompileCPU="0" CompileMemory="64">
    <RelOp AvgRowSize="246" EstimateCPU="0.008847" EstimateIO="0.114977" EstimateRebinds="0" EstimateRewinds="0" EstimateRows="7900" LogicalOp="Clustered Index Scan" NodeId="0" Parallel="false" PhysicalOp="Clustered Index Scan" EstimatedTotalSubtreeCost="0.123824">

Now let's compare that to the query with the derived table:

SELECT 
       [Extent1].[AddressId]
      ,[Extent1].[Address1]
      ,[Extent1].[Address2]
      ,[Extent1].[City]
      ,[Extent1].[State]
      ,[Extent1].[ZIP]
      ,[Extent1].[ZIPExtension]
  FROM (SELECT [AddressId]
          ,[Address1]
          ,[Address2]
          ,[City]
          ,[State]
          ,[ZIP]
          ,[ZIPExtension]
  FROM[VertexRM].[dbo].[Address]) AS [Extent1]

And the cost:

<StmtSimple StatementCompId="1" StatementEstRows="7900" StatementId="1" StatementOptmLevel="TRIVIAL" StatementSubTreeCost="0.123824" StatementText="/****** Script for SelectTopNRows command from SSMS  ******/&#xD;&#xA;SELECT &#xD;&#xA;       [Extent1].[AddressId]&#xD;&#xA;      ,[Extent1].[Address1]&#xD;&#xA;      ,[Extent1].[Address2]&#xD;&#xA;      ,[Extent1].[City]&#xD;&#xA;      ,[Extent1].[State]&#xD;&#xA;      ,[Extent1].[ZIP]&#xD;&#xA;      ,[Extent1].[ZIPExtension]&#xD;&#xA;  FROM (SELECT [AddressId]&#xD;&#xA;          ,[Address1]&#xD;&#xA;          ,[Address2]&#xD;&#xA;          ,[City]&#xD;&#xA;          ,[State]&#xD;&#xA;          ,[ZIP]&#xD;&#xA;          ,[ZIPExtension]&#xD;&#xA;  FROM[VertexRM].[dbo].[Address]) AS [Extent1]" StatementType="SELECT">
  <StatementSetOptions ANSI_NULLS="false" ANSI_PADDING="false" ANSI_WARNINGS="false" ARITHABORT="true" CONCAT_NULL_YIELDS_NULL="false" NUMERIC_ROUNDABORT="false" QUOTED_IDENTIFIER="false" />
  <QueryPlan CachedPlanSize="9" CompileTime="0" CompileCPU="0" CompileMemory="64">
    <RelOp AvgRowSize="246" EstimateCPU="0.008847" EstimateIO="0.114977" EstimateRebinds="0" EstimateRewinds="0" EstimateRows="7900" LogicalOp="Clustered Index Scan" NodeId="0" Parallel="false" PhysicalOp="Clustered Index Scan" EstimatedTotalSubtreeCost="0.123824">

In both cases, SQL Server simply scans the clustered index. Not surprisingly, the cost is almost precisely the same.

Let's take a look at a slightly more complicated query. I fired up LINQPad, and entered the following query against the same table, plus one related table:

from a in Addresses
select new
{
    Id = a.Id,
    Address1 = a.Address1,
    Address2 = a.Address2,
    City = a.City,
    State = a.State,
    ZIP = a.ZIP,
    ZIPExtension = a.ZIPExtension,
    PersonCount = a.EntityAddresses.Count()
}

This generates the following SQL:

SELECT 
1 AS [C1], 
[Project1].[AddressId] AS [AddressId], 
[Project1].[Address1] AS [Address1], 
[Project1].[Address2] AS [Address2], 
[Project1].[City] AS [City], 
[Project1].[State] AS [State], 
[Project1].[ZIP] AS [ZIP], 
[Project1].[ZIPExtension] AS [ZIPExtension], 
[Project1].[C1] AS [C2]
FROM ( SELECT 
    [Extent1].[AddressId] AS [AddressId], 
    [Extent1].[Address1] AS [Address1], 
    [Extent1].[Address2] AS [Address2], 
    [Extent1].[City] AS [City], 
    [Extent1].[State] AS [State], 
    [Extent1].[ZIP] AS [ZIP], 
    [Extent1].[ZIPExtension] AS [ZIPExtension], 
    (SELECT 
        COUNT(cast(1 as bit)) AS [A1]
        FROM [dbo].[EntityAddress] AS [Extent2]
        WHERE [Extent1].[AddressId] = [Extent2].[AddressId]) AS [C1]
    FROM [dbo].[Address] AS [Extent1]
)  AS [Project1]

Analyzing this, we can see that Project1 is the projection onto the anonymous type. Extent1 is the Address table/entity. And Extent2 is the table for the association. Now there is no derived table for Address, but there is one for the projection.

I don't know if you have ever written a SQL generation system, but it isn't easy. I believe that the general problem of proving that a LINQ to Entities query and a SQL query are equivalent is NP-hard, although certain specific cases are obviously much easier. SQL is intentionally Turing-incomplete, because its designers wanted all SQL queries to execute in bounded time. LINQ, not so.

In short, this is a very difficult problem to solve, and the combination of the Entity Framework and its providers do occasionally sacrifice some readability in favor of consistency over a wide range of queries. But it shouldn't be a performance issue.

like image 159
Craig Stuntz Avatar answered Oct 17 '22 09:10

Craig Stuntz