Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQLServer count() over() with distinct

I'm working on a project, where we need to count the number of distinct rows. A simplified version of the scenario includes a user table, a keyword table and a keyword_user table.

The user table just includes common user meta data, like name etc. The other tables are listed below.

keyword_user:

id
user_id
keyword_id


keyword:

id,
description

What I want to do, is to find a max number of users (5), based on the users keyword_id's, while also counting the total number of matching rows. The count must be distinct.

The query:

SELECT TOP 5 u.[id], 
             u.[firstname], 
             u.[lastname], 
             total = Count(*) OVER() 
FROM   [user] u 
       INNER JOIN [keyword_user] ku 
               ON u.[id] = ku.[user_id] 
WHERE  ( ku.keyword_id IN ( '5f6501ec-0a71-4067-a21d-3c5f87a76411', 'c19b95c0-8554-4bbd-9526-db8f1c4f1edf')) 
       AND u.id NOT IN ( '12db3001-b3b9-4626-8a02-2519102cb53a' ) 

Resultset:

+--------------------------------------+-----------+----------+-------+
|                  id                  | firstname | lastname | total |
+--------------------------------------+-----------+----------+-------+
| F0527AC3-747A-45A6-9CF9-B1F6C7F548F8 | Kasper    | Thomsen  |     3 |
| 95988F6D-9C91-4779-B6C3-3D4B4D6AE836 | Michael   | Jacobsen |     3 |
| 95988F6D-9C91-4779-B6C3-3D4B4D6AE836 | Michael   | Jacobsen |     3 |
+--------------------------------------+-----------+----------+-------+


PROBLEM:

The problem being here, that Michael is counted twice, and the total count therefore is 3, when i want it to be 2. When using count() over() you cannot parse an expression into it, that contains distinct. Also, if I just SELECT DISTINCT, my resultset looks fine, besides the total count, which would still be 3.

If I need to include more information to support the question, please let me know, and I will try to answer the best I can.

MSSQL CREATE DB SCRIPT (SAMPLE DATA)

example_data.sql

wanted resultset:

+--------------------------------------+-----------+----------+-------+
|                  id                  | firstname | lastname | total |
+--------------------------------------+-----------+----------+-------+
| F0527AC3-747A-45A6-9CF9-B1F6C7F548F8 | Kasper    | Thomsen  |     2 |
| 95988F6D-9C91-4779-B6C3-3D4B4D6AE836 | Michael   | Jacobsen |     2 |
+--------------------------------------+-----------+----------+-------+
like image 678
Teilmann Avatar asked Oct 16 '15 08:10

Teilmann


2 Answers

You really should explain what you need in the question, not in comments.

In CTE_Users we find all distinct users for the given keywords. Then join the result with user to get user details. At least it produces the result that you expect with the given small sample data.

WITH
CTE_Users
AS
(
    SELECT DISTINCT ku.user_id
    FROM
        keyword_user AS ku
    WHERE
        ku.keyword_id IN (
            '5f6501ec-0a71-4067-a21d-3c5f87a76411', 
            'c19b95c0-8554-4bbd-9526-db8f1c4f1edf')
        AND ku.user_id NOT IN (
            '12db3001-b3b9-4626-8a02-2519102cb53a')
)
SELECT TOP(5)
    u.id
    ,u.firstname
    ,u.lastname
    ,COUNT(*) OVER() AS total
FROM
    user AS u
    INNER JOIN CTE_Users ON CTE_Users.user_id = u.id
;
like image 155
Vladimir Baranov Avatar answered Oct 29 '22 03:10

Vladimir Baranov


i'm a little confused in your case specially with the "keywords" and how they're related per user (this is just a process issue for me) thus found myself doing an outer group by containing your initial query as my source table.

Please comment below so we can improve this.

SELECT 
    id
    , firstname
    , lastname
    , total
    , COUNT(*) AS [per_user_count]
FROM (
       SELECT TOP 5 u.[id], 
             u.[firstname], 
             u.[lastname], 
             total = Count(*) OVER() 
       FROM [user] u 
       INNER JOIN [keyword_user] ku 
           ON u.[id] = ku.[user_id] 
       WHERE 
       ( 
           ku.keyword_id IN (
              '5f6501ec-0a71-4067-a21d-3c5f87a76411'
              , 'c19b95c0-8554-4bbd-9526-db8f1c4f1edf'
           )
       ) 
       AND u.id NOT IN ('12db3001-b3b9-4626-8a02-2519102cb53a')
) AS T
GROUP BY
    T.id
    , T.firstname
    , T.lastname
    , T.total

EDIT : We really had a confusion there, then so I created a simpler script that shall exclude the keywords, just the unique users (to generate the total overall) and get the top 5 of it (random order).

    SELECT 
        TOP 5
        T.id
        , T.firstname
        , T.lastname
        , Total = COUNT(*) OVER()
    FROM (
        SELECT DISTINCT
            u.*
        FROM [keyword_user] ku
        LEFT JOIN [user] u
        ON
            ku.user_id = u.id
        WHERE 
        ( 
            ku.keyword_id IN (
                '5f6501ec-0a71-4067-a21d-3c5f87a76411'
                , 'c19b95c0-8554-4bbd-9526-db8f1c4f1edf')
        ) 
        AND ku.[user_id] NOT IN ( 
                '12db3001-b3b9-4626-8a02-2519102cb53a' 
            )
    ) AS T

Thanks

EDIT : Your scenario is a straight forward "search of keywords" that is linked to an entity, with an overall count and a top 5 result. As how I understand the CTE (and based in MSDN), CTE is a great solution for hierarchical data mining (no need to do while and do whatever back-flip to get your organizational hierarchy) which doesn't really fit the scenario that we have here.

like image 27
ken lacoste Avatar answered Oct 29 '22 02:10

ken lacoste