Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does a recursive correlated expression speed up distinct queries?

I found this post about speeding up distinct queries:

Super-fast DISTINCT using a recursive CTE:

USE     tempdb;
GO
DROP    TABLE dbo.Test;
GO
CREATE  TABLE 
        dbo.Test 
        (
        data            INTEGER NOT NULL,
        );
GO
CREATE  CLUSTERED INDEX c ON dbo.Test (data);
GO
-- Lots of duplicated values
INSERT  dbo.Test WITH (TABLOCK)
        (data)
SELECT  TOP (5000000)
        ROW_NUMBER() OVER (ORDER BY (SELECT 0)) / 117329
FROM    master.sys.columns C1,
        master.sys.columns C2,
        master.sys.columns C3;
GO



SET     STATISTICS TIME ON;

-- 1591ms CPU
SELECT  DISTINCT 
        data
FROM    dbo.Test;

-- 15ms CPU

WITH    RecursiveCTE
AS      (
        SELECT  data = MIN(T.data)
        FROM    dbo.Test T
        UNION   ALL
        SELECT  R.data
        FROM    (
                -- A cunning way to use TOP in the recursive part of a CTE Smile
                SELECT  T.data,
                        rn = ROW_NUMBER() OVER (ORDER BY T.data)
                FROM    dbo.Test T
                JOIN    RecursiveCTE R
                        ON  R.data < T.data
                ) R
        WHERE   R.rn = 1
        )
SELECT  *
FROM    RecursiveCTE
OPTION  (MAXRECURSION 0);

SET     STATISTICS TIME OFF;
GO
DROP    TABLE dbo.Test;

The recursive CTE is 100 times more efficient :-) This kind of speedup would be extremely valuable for my current project, but I am not sure in which cases this approach is beneficial.

To be honest: I don't get why this speeds up the query that much and why the database cannot do this optimization itself. Can you explain how this works and why it is so effective?


Edit: I see a similar effect on sybase, so this approach does not seem to be valid for sql-server only.

Sub-question: is the recursive CTE useful for other data base systems as well?

like image 920
Beginner Avatar asked Mar 14 '17 09:03

Beginner


1 Answers

Paul White explained this "trick" in great detail in his post Performance Tuning the Whole Query Plan under Finding the Distinct Values section.

Why the database cannot do this optimization itself?

Is the recursive CTE useful for other data base systems as well?

Optimizer is not perfect and it doesn't implement all possible techniques. People asked Microsoft to implement it. See this Connect item Implement Index Skip Scan. It was closed as Won't Fix, but that doesn't mean that it won't be addressed in the future. Other DBMSes may have implemented it (Connect item says that Oracle implements this optimization). If this kind of optimization is implemented in the DBMS engine, then this "trick" is not needed and optimizer would choose optimal method of calculating the result based on available statistics.

I don't get why this speeds up the query that much.

I am not sure in which cases this approach is beneficial

The simple DISTINCT query scans the whole index. "Scans" means that it reads each and every page of the index from the disk and aggregates the values in memory (or tempdb) to get a list of distinct values.

If you know that the table has a lot of rows, but only few different distinct values, then reading all these repeating values is a waste of time. Recursive CTE forces the server to seek an index for the first distinct value, then seek an index for the second value and so on. "Seek" means that server uses binary search in the index to find the value. Usually one seek requires reading of only few pages from disk. "Index" is a balanced tree.

If the table has only few distinct values it is faster to seek few times, than read all pages of the index. On the other hand, if there are a lot of distinct values then it would be faster to read all pages sequentially, than seeking for each consecutive value. This should give you an idea in what cases this approach is beneficial.

Obviously, if the table is small, it is faster to scan it. Only when the table becomes "large enough" you start seeing the difference in performance.


There is a related question on dba.se: Is it possible to get seek based parallel plan for distinct/group by?

like image 151
Vladimir Baranov Avatar answered Sep 18 '22 19:09

Vladimir Baranov