SQLServer count() over() with distinct

Question

I'm working on a project, where we need to count the number of distinct rows. A simplified version of the scenario includes a user table, a keyword table and a keyword_user table.

The user table just includes common user meta data, like name etc. The other tables are listed below.

keyword_user:

id
user_id
keyword_id

keyword:

id,
description

What I want to do, is to find a max number of users (5), based on the users keyword_id's, while also counting the total number of matching rows. The count must be distinct.

The query:

SELECT TOP 5 u.[id], 
             u.[firstname], 
             u.[lastname], 
             total = Count(*) OVER() 
FROM   [user] u 
       INNER JOIN [keyword_user] ku 
               ON u.[id] = ku.[user_id] 
WHERE  ( ku.keyword_id IN ( '5f6501ec-0a71-4067-a21d-3c5f87a76411', 'c19b95c0-8554-4bbd-9526-db8f1c4f1edf')) 
       AND u.id NOT IN ( '12db3001-b3b9-4626-8a02-2519102cb53a' )

Resultset:

+--------------------------------------+-----------+----------+-------+
|                  id                  | firstname | lastname | total |
+--------------------------------------+-----------+----------+-------+
| F0527AC3-747A-45A6-9CF9-B1F6C7F548F8 | Kasper    | Thomsen  |     3 |
| 95988F6D-9C91-4779-B6C3-3D4B4D6AE836 | Michael   | Jacobsen |     3 |
| 95988F6D-9C91-4779-B6C3-3D4B4D6AE836 | Michael   | Jacobsen |     3 |
+--------------------------------------+-----------+----------+-------+

PROBLEM:

The problem being here, that Michael is counted twice, and the total count therefore is 3, when i want it to be 2. When using count() over() you cannot parse an expression into it, that contains distinct. Also, if I just SELECT DISTINCT, my resultset looks fine, besides the total count, which would still be 3.

If I need to include more information to support the question, please let me know, and I will try to answer the best I can.

MSSQL CREATE DB SCRIPT (SAMPLE DATA)

example_data.sql

wanted resultset:

+--------------------------------------+-----------+----------+-------+
|                  id                  | firstname | lastname | total |
+--------------------------------------+-----------+----------+-------+
| F0527AC3-747A-45A6-9CF9-B1F6C7F548F8 | Kasper    | Thomsen  |     2 |
| 95988F6D-9C91-4779-B6C3-3D4B4D6AE836 | Michael   | Jacobsen |     2 |
+--------------------------------------+-----------+----------+-------+

Vladimir Baranov · Accepted Answer

You really should explain what you need in the question, not in comments.

In CTE_Users we find all distinct users for the given keywords. Then join the result with user to get user details. At least it produces the result that you expect with the given small sample data.

WITH
CTE_Users
AS
(
    SELECT DISTINCT ku.user_id
    FROM
        keyword_user AS ku
    WHERE
        ku.keyword_id IN (
            '5f6501ec-0a71-4067-a21d-3c5f87a76411', 
            'c19b95c0-8554-4bbd-9526-db8f1c4f1edf')
        AND ku.user_id NOT IN (
            '12db3001-b3b9-4626-8a02-2519102cb53a')
)
SELECT TOP(5)
    u.id
    ,u.firstname
    ,u.lastname
    ,COUNT(*) OVER() AS total
FROM
    user AS u
    INNER JOIN CTE_Users ON CTE_Users.user_id = u.id
;

ken lacoste · Answer

i'm a little confused in your case specially with the "keywords" and how they're related per user (this is just a process issue for me) thus found myself doing an outer group by containing your initial query as my source table.

Please comment below so we can improve this.

SELECT 
    id
    , firstname
    , lastname
    , total
    , COUNT(*) AS [per_user_count]
FROM (
       SELECT TOP 5 u.[id], 
             u.[firstname], 
             u.[lastname], 
             total = Count(*) OVER() 
       FROM [user] u 
       INNER JOIN [keyword_user] ku 
           ON u.[id] = ku.[user_id] 
       WHERE 
       ( 
           ku.keyword_id IN (
              '5f6501ec-0a71-4067-a21d-3c5f87a76411'
              , 'c19b95c0-8554-4bbd-9526-db8f1c4f1edf'
           )
       ) 
       AND u.id NOT IN ('12db3001-b3b9-4626-8a02-2519102cb53a')
) AS T
GROUP BY
    T.id
    , T.firstname
    , T.lastname
    , T.total

EDIT : We really had a confusion there, then so I created a simpler script that shall exclude the keywords, just the unique users (to generate the total overall) and get the top 5 of it (random order).

    SELECT 
        TOP 5
        T.id
        , T.firstname
        , T.lastname
        , Total = COUNT(*) OVER()
    FROM (
        SELECT DISTINCT
            u.*
        FROM [keyword_user] ku
        LEFT JOIN [user] u
        ON
            ku.user_id = u.id
        WHERE 
        ( 
            ku.keyword_id IN (
                '5f6501ec-0a71-4067-a21d-3c5f87a76411'
                , 'c19b95c0-8554-4bbd-9526-db8f1c4f1edf')
        ) 
        AND ku.[user_id] NOT IN ( 
                '12db3001-b3b9-4626-8a02-2519102cb53a' 
            )
    ) AS T

Thanks

EDIT : Your scenario is a straight forward "search of keywords" that is linked to an entity, with an overall count and a top 5 result. As how I understand the CTE (and based in MSDN), CTE is a great solution for hierarchical data mining (no need to do while and do whatever back-flip to get your organizational hierarchy) which doesn't really fit the scenario that we have here.

SQLServer count() over() with distinct

Tags:

join

inner-join

sql-server

count

window-functions

Teilmann

2 Answers

Vladimir Baranov

ken lacoste

Recent Activity

Donate For Us

SQLServer count() over() with distinct

Tags:

join

inner-join

sql-server

count

window-functions

Teilmann

2 Answers

Vladimir Baranov

ken lacoste

Related questions

Recent Activity

Donate For Us