Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performance using DISTINCT COUNT

I am running SQL Server 2012.

I have a query that when striped to its most basic form looks like this:

SELECT COUNT(DISTINCT fullAddress) as quickCount 
FROM leads
WHERE yearID >=12 AND yearID <=21

The leads table has about 149 Million records in it. There is a clustered index on the leadID and a non-clustered index that is indexes on YearID and has an include for fullAddress.

This query as it is takes about 40 secs to run. I realize that is not bad but in this situation that is just not fast enough.

I looked at the execution plan and from what I can tell about 60% of the cost is the DISTINCT COUNT.

When I run the same query without the DISTINCT COUNT like this:

SELECT COUNT(*) as quickCount 
FROM leads
WHERE yearID >=12 AND yearID <=21

It takes only 1 sec to run.

Unfortunately, I need to get a count of distinct full addresses. So I am trying to figure out if there is anything I can do to make the first query run faster.

Here is a screenshot of the execution plan for both queries:

enter image description here

Here is a link to that to see it bigger - http://www.sequenzia.com/execPlan.jpg

From what I can tell my main problem is the Distinct Sort (52%).

Any help or feedback on this would be great.

Thanks!

UPDATE

I took Thilo's advice and applied this index:

 CREATE INDEX IDX_X ON LEADS(FULLADDRESS, YEARID);

I actually created 2 new test tables with the exact same 1 Million records in each of them. I applied my same original index to both and then the above index to just one. Now when I compare the the 2 tables on the same execution plan the one with the above index is a little better 48% to 52%. Here is the new execution plan - http://www.sequenzia.com/execPlan2.jpg

That helps some but I really need more performance. Any other ideas out there?

like image 690
Sequenzia Avatar asked Dec 18 '12 00:12

Sequenzia


2 Answers

This answer does not help to solve original question(SQL Server 2012). It will work on Azure SQL DB and SQL Server 2019+.

##Approximate Count Distinct enters Public Preview in Azure SQL Database

Approximate Query Processing is a new family of features which are designed to provide aggregations across very large data sets where responsiveness is more critical than absolute precision. ...

The new APPROX_COUNT_DISTINCT aggregate function returns the approximate number of unique non-null values in a group.

##Advantages

APPROX_COUNT_DISTINCT uses far less memory to calculate the distinct count than an exhaustive precise COUNT DISTINCT would. For this reason it is far more likely that the calculation can be done in memory without spilling to disk, even when there are billions of rows in the dataset. In cases where COUNT DISTINCT would run out of memory and spill data to TempDB which causes very large performance penalties, APPROX_COUNT_DISTINCT would typically not spill.

Partly as a consequence of not spilling large datasets to TempDB, and partly as a consequence of the algorithms internally, for very large datasets APPROX_COUNT_DISTINCT will execute far faster than COUNT DISTINCT.

More at APPROX_COUNT_DISTINCT doc

SELECT APPROX_COUNT_DISTINCT(fullAddress) as quickCount 
FROM leads
WHERE yearID >=12 AND yearID <=21
like image 143
Lukasz Szozda Avatar answered Sep 18 '22 07:09

Lukasz Szozda


One thing to try is to get rid of the sorting, by having an index ordered on fullAddress (that also includes the yearID column so that you can satisfy the where clause, too).

CREATE INDEX IDX_X ON LEADS(FULLADDRESS, YEARID);

This way, you should get a Fast Full Index Scan (probably still slower than the Index Range Scan you have for non-distinct count, but hopefully faster than your 40s sorting).

But why does it need to be so fast? This is not something you need to do all the time, right? If this is for a public web site, you can get away with a slightly outdated cached result, I would think.

like image 27
Thilo Avatar answered Sep 20 '22 07:09

Thilo