I'm getting poor performance from DISTINCT. The explain plan indicates that it is doing SORT (GROUP BY) which doesn't sound right. I would expect some kind of HASH aggregation to produce much better result. Is there a hint to tell oracle to use HASH for DISTINCT rather than sort? I've used /*+ USE_HASH_AGGREGATION */ in similar situations, but it is not working for DISTINCT.
So this is my original query:
SELECT
count(distinct userid) n, col
FROM users
GROUP BY col;
users has 30M rows, each userid is there 12 times. This query takes 70 seconds.
Now we rewrite it as
SELECT
count(userid) n, col
FROM
(SELECT distinct userid, col FROM users)
GROUP BY col
And it takes 40 seconds. Now add the hint to do hash instead of sort:
SELECT
count(userid) n, col
FROM
(SELECT /*+ USE_HASH_AGGREGATION */ distinct userid, col FROM users)
GROUP BY col
and it takes 10 seconds.
If somebody can explain to me why this is happening or how I can beat the first simple query into working as good as the 3rd one, that would be fantastic.
The reason I care about query simplicity is because these queries are actually generated.
Plans: 1) Slow:
----------------------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | Reads | OMem | 1Mem | Used-Mem | Used-Tmp|
--------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 5 |00:01:12.01 | 283K| 292K| | | | |
| 1 | SORT GROUP BY | | 1 | 5 | 5 |00:01:12.01 | 283K| 292K| 194M| 448K| 172M (0)| 73728 |
| 2 | TABLE ACCESS FULL| USERS | 1 | 29M| 29M|00:00:08.17 | 283K| 283K| | | | |
2) Fast
--------------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | Reads | OMem | 1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 5 |00:00:13.09 | 283K| 283K| | | |
| 1 | SORT GROUP BY | | 1 | 5 | 5 |00:00:13.09 | 283K| 283K| 3072 | 3072 | 2048 (0)|
| 2 | VIEW | | 1 | 8647K| 2445K|00:00:13.16 | 283K| 283K| | | |
| 3 | HASH UNIQUE | | 1 | 8647K| 2445K|00:00:12.57 | 283K| 283K| 113M| 10M| 216M (0)|
| 4 | TABLE ACCESS FULL| USERS | 1 | 29M| 29M|00:00:07.68 | 283K| 283K| | | |
--------------------------------------------------------------------------------------------------------------------------------------------
No. There are a number of circumstances in which a DISTINCT in Oracle does not imply a sort, the most important of which is the hashing algorithm used in 10g+ for both group by and distinct operations. Always specify ORDER BY if you want an ordered result set, even in 9i and below.
Without a transformation, a statement that contains both DISTINCT and ORDER BY would require two separate sorting steps-one to satisfy DISTINCT and one to satisfy ORDER BY. (Currently, Derby uses sorting to evaluate DISTINCT.
I personally wouldn't consider performance, but what is semantically correct. DISTINCT implies you want a distinct set of columns. However, GROUP BY implies you want to compute some sort of aggregate value which you are not.
The DISTINCT clause is used in a SELECT statement to filter duplicate rows in the result set. It ensures that rows returned are unique for the column or columns specified in the SELECT clause.
How about trying the following: If you had an index on col and userid it should resolve completely in the index and not need to touch the table at all.
Select count(userid) n, col
from (select col, userid from users group by col, userid)
group by col
;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With