Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimize/rewrite LINQ query with GROUP BY and COUNT

Tags:

c#

linq

I'm trying to get a count of unique Foos and Bars grouped by Name, on the following data set.

Id  |   IsActive    |   Name    |   Foo     |   Bar
1   |       1       |   A       |   11      |   null
2   |       1       |   A       |   11      |   null
3   |       1       |   A       |   null    |   123
4   |       1       |   B       |   null    |   321

I expect the result on the above data to be:

Expected:
A = 2;
B = 1;

I tried to group by Name,Foo,Bar and then group by Name again with a count to get the "row" count. But that didn't give me the correct result. (or the ToDictionary threw a duplicate key, I played around with this a lot so can't quite remember)

db.MyEntity
    .Where(x => x.IsActive)
    .GroupBy(x => new { x.Name, x.Foo, x.Bar })
    .GroupBy(x => new { x.Key.Name, Count = x.Count() })
    .ToDictionary(x => x.Key, x => x.Count);

So I came up with this LINQ query. But it's rather slow.

db.MyEntity
    .Where(x => x.IsActive)
    .GroupBy(x => x.Name)
    .ToDictionary(x => x.Key,
        x =>
            x.Where(y => y.Foo != null).Select(y => y.Foo).Distinct().Count() +
            x.Where(y => y.Bar != null).Select(y => y.Bar).Distinct().Count());

How can I optimize it?

Here's the entity for refernece

public class MyEntity
{
    public int Id { get; set; }
    public bool IsActive { get; set; }
    public string Name { get; set; }
    public int? Foo { get; set; }
    public int? Bar { get; set; }
}

Edit

I also tried this query

db.MyEntity
    .Where(x => x.IsActive)
    .GroupBy(x => new { x.Name, x.Foo, x.Bar })
    .GroupBy(x => x.Key.Name)
    .ToDictionary(x => x.Key, x => x.Count());

But that threw a timeout exception :(

like image 318
Snæbjørn Avatar asked Oct 20 '22 10:10

Snæbjørn


2 Answers

The query is extremely inefficient because you're doing much of the work (everything involved in building the dictionary) on the client side, without being able to use the database to do your projections. This is a problem both because the database (especially if these values are indexed) can do this work faster than the client, and also because doing the projections on the database involves much less data being sent over the network.

So simply do your projections before you group the data.

var activeItems = db.MyEntity.Where(x => x.IsActive);

var query = activeItems.Select(x => new { Name, Value = x.Foo}).Distinct()
    .Concat(activeItems.Select(x => new { Name, Value = x.Bar}).Distinct())        
    .Where(x => x != null)
    .GroupBy(pair => pair.Name)
    .Select(group => new { group.Key, Count = Group.Count()})
    .ToDictionary(pair => pair.Key, pair => pair.Count);
like image 78
Servy Avatar answered Nov 03 '22 03:11

Servy


Your aim is to produce the following query:

select Name, count(distinct Foo) + count(distinct Bar)
from myEntity
where IsActive = 1
group by Name

This is the minimal query to get what you want. But LINQ seems to overcomplicate everything as much as possible :)

Your aim is to do as much at database level as possible. Now your query is translated to:

SELECT 
    [Project2].[C1] AS [C1], 
    [Project2].[Name] AS [Name], 
    [Project2].[C2] AS [C2], 
    [Project2].[id] AS [id], 
    [Project2].[IsActive] AS [IsActive], 
    [Project2].[Name1] AS [Name1], 
    [Project2].[Foo] AS [Foo], 
    [Project2].[Bar] AS [Bar]
    FROM ( SELECT 
        [Distinct1].[Name] AS [Name], 
        1 AS [C1], 
        [Extent2].[id] AS [id], 
        [Extent2].[IsActive] AS [IsActive], 
        [Extent2].[Name] AS [Name1], 
        [Extent2].[Foo] AS [Foo], 
        [Extent2].[Bar] AS [Bar], 
        CASE WHEN ([Extent2].[id] IS NULL) THEN CAST(NULL AS int) ELSE 1 END AS [C2]
        FROM   (SELECT DISTINCT 
            [Extent1].[Name] AS [Name]
            FROM [dbo].[SomeTable] AS [Extent1]
            WHERE [Extent1].[IsActive] = 1 ) AS [Distinct1]
        LEFT OUTER JOIN [dbo].[SomeTable] AS [Extent2] ON ([Extent2].[IsActive] = 1) AND ([Distinct1].[Name] = [Extent2].[Name])
    )  AS [Project2]
    ORDER BY [Project2].[Name] ASC, [Project2].[C2] ASC

It selects everything from database and performs grouping at application layer, that is inefficient.

The query of @Servy:

var activeItems = db.MyEntity.Where(x => x.IsActive);

var query = activeItems.Select(x => new { Name, Value = x.Foo}).Distinct()
.Concat(activeItems.Select(x => new { Name, Value = x.Bar}).Distinct())        
.Where(x => x != null)
.GroupBy(pair => pair.Name)
.Select(group => new { group.Key, Count = Group.Count()})
.ToDictionary(pair => pair.Key, pair => pair.Count);

is translated to:

SELECT 
1 AS [C1], 
[GroupBy1].[K1] AS [C2], 
[GroupBy1].[A1] AS [C3]
FROM ( SELECT 
    [UnionAll1].[Name] AS [K1], 
    COUNT(1) AS [A1]
    FROM  (SELECT 
        [Distinct1].[Name] AS [Name]
        FROM ( SELECT DISTINCT 
            [Extent1].[Name] AS [Name], 
            [Extent1].[Foo] AS [Foo]
            FROM [dbo].[SomeTable] AS [Extent1]
            WHERE ([Extent1].[IsActive] = 1) AND ([Extent1].[Foo] IS NOT NULL)
        )  AS [Distinct1]
    UNION ALL
        SELECT 
        [Distinct2].[Name] AS [Name]
        FROM ( SELECT DISTINCT 
            [Extent2].[Name] AS [Name], 
            [Extent2].[Bar] AS [Bar]
            FROM [dbo].[SomeTable] AS [Extent2]
            WHERE ([Extent2].[IsActive] = 1) AND ([Extent2].[Bar] IS NOT NULL)
        )  AS [Distinct2]) AS [UnionAll1]
    GROUP BY [UnionAll1].[Name]
)  AS [GroupBy1]

It is much better.

I have tried the following:

var activeItems = (from o in db.SomeTables
                   where o.IsActive
                   group o by o.Name into gr
                   select new { gr.Key, cc = gr.Select(c => c.Foo).Distinct().Count(c => c != null) + 
                                             gr.Select(c => c.Bar).Distinct().Count(c => c != null) }).ToDictionary(c => c.Key);

This is translated to:

SELECT 
1 AS [C1], 
[Project5].[Name] AS [Name], 
[Project5].[C1] + [Project5].[C2] AS [C2]
FROM ( SELECT 
    [Project3].[Name] AS [Name], 
    [Project3].[C1] AS [C1], 
    (SELECT 
        COUNT(1) AS [A1]
        FROM ( SELECT DISTINCT 
            [Extent3].[Bar] AS [Bar]
            FROM [dbo].[SomeTable] AS [Extent3]
            WHERE ([Extent3].[IsActive] = 1) AND ([Project3].[Name] = [Extent3].[Name]) AND ([Extent3].[Bar] IS NOT NULL)
        )  AS [Distinct3]) AS [C2]
    FROM ( SELECT 
        [Distinct1].[Name] AS [Name], 
        (SELECT 
            COUNT(1) AS [A1]
            FROM ( SELECT DISTINCT 
                [Extent2].[Foo] AS [Foo]
                FROM [dbo].[SomeTable] AS [Extent2]
                WHERE ([Extent2].[IsActive] = 1) AND ([Distinct1].[Name] = [Extent2].[Name]) AND ([Extent2].[Foo] IS NOT NULL)
            )  AS [Distinct2]) AS [C1]
        FROM ( SELECT DISTINCT 
            [Extent1].[Name] AS [Name]
            FROM [dbo].[SomeTable] AS [Extent1]
            WHERE [Extent1].[IsActive] = 1
        )  AS [Distinct1]
    )  AS [Project3]
)  AS [Project5]

Much the same but without unions as in second version.

Conclusion:

I would create a view and import it in model if table is quite large and performance is crucial. Otherwise stick on 3rd version or 2rd version of @Servy. Performance should be tested of course.

like image 30
Giorgi Nakeuri Avatar answered Nov 03 '22 02:11

Giorgi Nakeuri