I found this post about speeding up distinct queries:
Super-fast DISTINCT using a recursive CTE:
USE tempdb;
GO
DROP TABLE dbo.Test;
GO
CREATE TABLE
dbo.Test
(
data INTEGER NOT NULL,
);
GO
CREATE CLUSTERED INDEX c ON dbo.Test (data);
GO
-- Lots of duplicated values
INSERT dbo.Test WITH (TABLOCK)
(data)
SELECT TOP (5000000)
ROW_NUMBER() OVER (ORDER BY (SELECT 0)) / 117329
FROM master.sys.columns C1,
master.sys.columns C2,
master.sys.columns C3;
GO
SET STATISTICS TIME ON;
-- 1591ms CPU
SELECT DISTINCT
data
FROM dbo.Test;
-- 15ms CPU
WITH RecursiveCTE
AS (
SELECT data = MIN(T.data)
FROM dbo.Test T
UNION ALL
SELECT R.data
FROM (
-- A cunning way to use TOP in the recursive part of a CTE Smile
SELECT T.data,
rn = ROW_NUMBER() OVER (ORDER BY T.data)
FROM dbo.Test T
JOIN RecursiveCTE R
ON R.data < T.data
) R
WHERE R.rn = 1
)
SELECT *
FROM RecursiveCTE
OPTION (MAXRECURSION 0);
SET STATISTICS TIME OFF;
GO
DROP TABLE dbo.Test;
The recursive CTE is 100 times more efficient :-) This kind of speedup would be extremely valuable for my current project, but I am not sure in which cases this approach is beneficial.
To be honest: I don't get why this speeds up the query that much and why the database cannot do this optimization itself. Can you explain how this works and why it is so effective?
Edit: I see a similar effect on sybase, so this approach does not seem to be valid for sql-server only.
Sub-question: is the recursive CTE useful for other data base systems as well?
Paul White explained this "trick" in great detail in his post Performance Tuning the Whole Query Plan under Finding the Distinct Values section.
Why the database cannot do this optimization itself?
Is the recursive CTE useful for other data base systems as well?
Optimizer is not perfect and it doesn't implement all possible techniques. People asked Microsoft to implement it. See this Connect item Implement Index Skip Scan. It was closed as Won't Fix, but that doesn't mean that it won't be addressed in the future. Other DBMSes may have implemented it (Connect item says that Oracle implements this optimization). If this kind of optimization is implemented in the DBMS engine, then this "trick" is not needed and optimizer would choose optimal method of calculating the result based on available statistics.
I don't get why this speeds up the query that much.
I am not sure in which cases this approach is beneficial
The simple DISTINCT
query scans the whole index. "Scans" means that it reads each and every page of the index from the disk and aggregates the values in memory (or tempdb) to get a list of distinct values.
If you know that the table has a lot of rows, but only few different distinct values, then reading all these repeating values is a waste of time. Recursive CTE forces the server to seek an index for the first distinct value, then seek an index for the second value and so on. "Seek" means that server uses binary search in the index to find the value. Usually one seek requires reading of only few pages from disk. "Index" is a balanced tree.
If the table has only few distinct values it is faster to seek few times, than read all pages of the index. On the other hand, if there are a lot of distinct values then it would be faster to read all pages sequentially, than seeking for each consecutive value. This should give you an idea in what cases this approach is beneficial.
Obviously, if the table is small, it is faster to scan it. Only when the table becomes "large enough" you start seeing the difference in performance.
There is a related question on dba.se: Is it possible to get seek based parallel plan for distinct/group by?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With