Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

COUNT (DISTINCT column_name) Discrepancy vs. COUNT (column_name) in SQL Server 2008?

I'm running into a problem that's driving me nuts. When running the query below, I get a count of 233,769

 SELECT COUNT(distinct  Member_List_Link.UserID)  
 FROM Member_List_Link  with (nolock)   
 INNER JOIN MasterMembers with (nolock)  
     ON Member_List_Link.UserID = MasterMembers.UserID   
  WHERE MasterMembers.Active = 1 And
        Member_List_Link.GroupID = 5 AND 
        MasterMembers.ValidUsers = 1 AND 
        Member_List_Link.Status = 1

But if I run the same query without the distinct keyword, I get a count of 233,748

 SELECT COUNT(Member_List_Link.UserID)  
 FROM Member_List_Link  with (nolock)   
 INNER JOIN MasterMembers with (nolock)
   ON Member_List_Link.UserID = MasterMembers.UserID   
 WHERE MasterMembers.Active = 1 And Member_List_Link.GroupID = 5 
  AND MasterMembers.ValidUsers = 1 AND Member_List_Link.Status = 1

To test, I recreated all the tables and place them into temp tables and ran the queries again:

  SELECT COUNT(distinct  #Temp_Member_List_Link.UserID)  
  FROM #Temp_Member_List_Link  with (nolock)   
  INNER JOIN #Temp_MasterMembers with (nolock)
    ON #Temp_Member_List_Link.UserID = #Temp_MasterMembers.UserID   
  WHERE #Temp_MasterMembers.Active = 1 And 
        #Temp_Member_List_Link.GroupID = 5 AND 
        #Temp_MasterMembers.ValidUsers = 1 AND 
        #Temp_Member_List_Link.Status = 1

And without the distinct keyword

  SELECT COUNT(#Temp_Member_List_Link.UserID)  
  FROM #Temp_Member_List_Link  with (nolock)   
  INNER JOIN #Temp_MasterMembers with (nolock)
    ON #Temp_Member_List_Link.UserID = #Temp_MasterMembers.UserID   
  WHERE #Temp_MasterMembers.Active = 1 And 
        #Temp_Member_List_Link.GroupID = 5 AND 
        #Temp_MasterMembers.ValidUsers = 1 AND 
        #Temp_Member_List_Link.Status = 1

On a side note, I recreated the temp tables by simply running (select * from Member_List_Link into #temp...)

And now when I check to see the difference between COUNT(column) vs. COUNT(distinct column) with these temp tables, I don't see any!

So why is there a discrepancy with the original tables?

I'm running SQL Server 2008 (Dev Edition).

UPDATE - Including statistics profile

PhysicalOp column only for the first query (without distinct)

NULL
Compute Scalar
Stream Aggregate
Clustered Index Seek

PhysicalOp column only for the first query (with distinct)

NULL
Compute Scalar
Stream Aggregate
Parallelism
Stream Aggregate
Hash Match
Hash Match
Bitmap
Parallelism
Index Seek
Parallelism
Clustered Index Scan

Rows and Executes for the 1st query (without distinct)

1   1
0   0
1   1
1   1

Rows and Executes for the 2nd query (with distinct)

Rows    Executes
1   1
0   0
1   1
16  1
16  16
233767  16
233767  16
281901  16
281901  16
281901  16
234787  16
234787  16

Adding OPTION(MAXDOP 1) to the 2nd query (with distinct)

Rows Executes

1           1
0           0
1           1
233767          1
233767          1
281901          1
548396          1

And the resulting PhysicalOp

NULL
Compute Scalar
Stream Aggregate
Hash Match
Hash Match
Index Seek
Clustered Index Scan
like image 803
Ray Avatar asked Sep 30 '11 17:09

Ray


People also ask

What is the difference between COUNT distinct and distinct COUNT?

Count would show a result of all records while count distinct will result in showing only distinct count. For instance, a table has 5 records as a,a,b,b,c then Count is 5 while Count distinct is 3.

What is the difference between COUNT (*) and COUNT column_name?

The difference is: COUNT(*) will count the number of records. COUNT(column_name) will count the number of records where column_name is not null.

What is the difference between COUNT and COUNT star?

Difference between count(*) and count(columnName) in MySQL? The count(*) returns all rows whether column contains null value or not while count(columnName) returns the number of rows except null rows.

Which is faster COUNT or COUNT distinct?

A simple COUNT(*) just has to count number of rows - no sorting involved, so it will always be faster than COUNT(DISTINCT) .


1 Answers

FROM http://msdn.microsoft.com/en-us/library/ms187373.aspx NOLOCK Is equivalent to READUNCOMMITTED. For more information, see READUNCOMMITTED later in this topic.

READUNCOMMITED will read rows twice if they are the subject of a transation- since both the roll foward and roll back rows exist within the database when the transaction is IN process.

By default all queries are read committed which excludes uncommitted rows

When you insert into a temp table the select will give you only committed rows - I believe this covers all the symptoms you are trying to explain

like image 131
Ian P Avatar answered Oct 21 '22 22:10

Ian P