Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to solve the greatest-n-per-group problem with Entity Framework (Core)?

Question

Given, for example, the following data set:

new Entity { Id = 1, Group = 1, Value = "ABC", ... },
new Entity { Id = 2, Group = 1, Value = "DEF", ... },
new Entity { Id = 3, Group = 1, Value = "FGH", ... },
new Entity { Id = 4, Group = 1, Value = "LOP", ... },
new Entity { Id = 5, Group = 2, Value = "ALO", ... },
new Entity { Id = 6, Group = 2, Value = "PEO", ... },
new Entity { Id = 7, Group = 2, Value = "AHB", ... },
new Entity { Id = 8, Group = 2, Value = "DHB", ... },
new Entity { Id = 9, Group = 2, Value = "QPA", ... },
new Entity { Id = 10, Group = 2, Value = "LAN", ... },
// ... millions more records

how can I make a query, which is efficient (avoids the N+1 query problem) and gives me the top 3 records for each Group ordered by Value?

new Entity { Id = 1, Group = 1, Value = "ABC", ... },
new Entity { Id = 2, Group = 1, Value = "DEF", ... },
new Entity { Id = 3, Group = 1, Value = "FGH", ... },
new Entity { Id = 5, Group = 2, Value = "ALO", ... },
new Entity { Id = 7, Group = 2, Value = "AHB", ... },
new Entity { Id = 8, Group = 2, Value = "DHB", ... },
// ...

What have I tried?

Most of the LINQ or Entity Framework solutions on Stack Overflow use GroupBy with Take which is evaluated client side (which means that all records are imported in memory and then the grouping happens outside of the database).

I've tried with:

var list = await _dbContext.Entities
    .Select(x => new 
    { 
        OrderKey = _dbContext.Entities.Count(y =>
            x.Group == y.Group
                && y.Value < x.Value),
        Value = x,
     })
     .Where(x => x.OrderKey < 3)
     .OrderBy(x => x.OrderKey)
     .Select(x => x.Value)
     .ToListAsync(cancellationToken);

but I'm pretty sure that's as inefficient as it gets.

Bonus question

How can I extract this logic into an extension method for IQueryable<T> which returns IQueryable<T>?

like image 713
Shoe Avatar asked Mar 30 '19 09:03

Shoe


1 Answers

Interesting question. The main problem I see is that there is no standard SQL construct for performing such operation - most of the databases provide their own operators for working with rowset "window", like SqlServer's SELECT - OVER etc. There is also no "standard" LINQ operator / pattern for that.

Given

IQueryable<Entity> source

the typical way of performing such operation in LINQ is

var query = source.GroupBy(e => e.Group)
    .SelectMany(g => g.OrderBy(e => e.Value).Take(3));

which EF6 translates to the following SQL

SELECT
    [Limit1].[Id] AS [Id],
    [Limit1].[Group] AS [Group],
    [Limit1].[Value] AS [Value]
    FROM   (SELECT DISTINCT
        [Extent1].[Group] AS [Group]
        FROM [dbo].[Entity] AS [Extent1] ) AS [Distinct1]
    CROSS APPLY  (SELECT TOP (3) [Project2].[Id] AS [Id], [Project2].[Group] AS [Group], [Project2].[Value] AS [Value]
        FROM ( SELECT
            [Extent2].[Id] AS [Id],
            [Extent2].[Group] AS [Group],
            [Extent2].[Value] AS [Value]
            FROM [dbo].[Entity] AS [Extent2]
            WHERE [Distinct1].[Group] = [Extent2].[Group]
        )  AS [Project2]
        ORDER BY [Project2].[Value] ASC ) AS [Limit1]

I can't say if it's good or bad translation, but at least it is some translation. The important is that EF Core currently (latest 2.2.3 at the time of writing) cannot translate it to SQL and will use client evaluation (as you mentioned).

So currently there seem to be only 3 translatable LINQ ways of writing such query:

(1) (yours)

var query = source.Where(e => source.Count(
    e2 => e2.Group == e.Group && e2.Value.CompareTo(e.Value) < 0) < 3);

translates to

  SELECT [e].[Id], [e].[Group], [e].[Value]
  FROM [Entity] AS [e]
  WHERE (
      SELECT COUNT(*)
      FROM [Entity] AS [e2]
      WHERE ([e2].[Group] = [e].[Group]) AND [e2].[Value] < [e].[Value]
  ) < 3

(2)

var query = source.Where(e => source.Where(e2 => e2.Group == e.Group)
    .OrderBy(e2 => e2.Value).Take(3).Contains(e));

translates to

  SELECT [e].[Id], [e].[Group], [e].[Value]
  FROM [Entity] AS [e]
  WHERE [e].[Id] IN (
      SELECT TOP(3) [e2].[Id]
      FROM [Entity] AS [e2]
      WHERE [e2].[Group] = [e].[Group]
      ORDER BY [e2].[Value]
  )

(3)

var query = source.SelectMany(e => source.Where(e2 => e2.Group == e.Group)
    .OrderBy(e2 => e2.Value).Take(3).Where(e2 => e2.Id == e.Id));

translates to

  SELECT [t].[Id], [t].[Group], [t].[Value]
  FROM [Entity] AS [e]
  CROSS APPLY (
      SELECT TOP(3) [e2].[Id], [e2].[Group], [e2].[Value]
      FROM [Entity] AS [e2]
      WHERE [e2].[Group] = [e].[Group]
      ORDER BY [e2].[Value]
  ) AS [t]
  WHERE [t].[Id] = [e].[Id]

I can't say which one to choose - you have to measure the execution plans.

The main drawback of #1 the comparison operator (as it can be seen in the example - can't use < for strings, for Guids it's even worse), and also won't work correctly if the Value is not unique inside the grouping.

From the other side it might be the fastest of the three. But it's possible the execution plan for #2 and #3 (and even #1) to be the same.

With that being said, I'm not going to provide a generalized method, since all these approaches require different parameters, with only common eventually being the group selector Expression<Func<T, TGroupKey>> (e.g. e => e.Group). But (especially for #2 and #3) it's possible to write such a method - it would need some manual Expression manipulations, and overall I'm not sure it does worth the efforts

like image 138
Ivan Stoev Avatar answered Nov 10 '22 06:11

Ivan Stoev