I have a performance question about Common table expressions in SQL Server. In our developer team we use a lot of chaining CTEs when building our queries. I am currently working on a query which had terrible performance. But I found out that if I in the middle of the chain inserted all the records up to that CTE in a temporary table instead and then continued but selecting from that temp table I improved performance significantly. Now I would like to get some help to understand if this type of change only applies to this specific query and why the two cases you will see below differ so much in performance. Or could we possibly overuse CTEs in our team and can we gain performance generally by learning from this case?
Please try to explain to me exactly what is happening here...
The code is complete and you will be able to run it on SQL Server 2008 and probably 2005 too. One part is commented out and my idea is that you can switch the two cases by comment out one or the other. You can see where to put your block comments, I have marked these places with --block comment here
and --end block comment here
It is the slow performing case that is uncommented default. Here you are:
--Declare tables to use in example.
CREATE TABLE #Preparation
(
Date DATETIME NOT NULL
,Hour INT NOT NULL
,Sales NUMERIC(9,2)
,Items INT
);
CREATE TABLE #Calendar
(
Date DATETIME NOT NULL
)
CREATE TABLE #OpenHours
(
Day INT NOT NULL,
OpenFrom TIME NOT NULL,
OpenTo TIME NOT NULL
);
--Fill tables with sample data.
INSERT INTO #OpenHours (Day, OpenFrom, OpenTo)
VALUES
(1, '10:00', '20:00'),
(2, '10:00', '20:00'),
(3, '10:00', '20:00'),
(4, '10:00', '20:00'),
(5, '10:00', '20:00'),
(6, '10:00', '20:00'),
(7, '10:00', '20:00')
DECLARE @CounterDay INT = 0, @CounterHour INT = 0, @Sales NUMERIC(9, 2), @Items INT;
WHILE @CounterDay < 365
BEGIN
SET @CounterHour = 0;
WHILE @CounterHour < 5
BEGIN
SET @Items = CAST(RAND() * 100 AS INT);
SET @Sales = CAST(RAND() * 1000 AS NUMERIC(9, 2));
IF @Items % 2 = 0
BEGIN
SET @Items = NULL;
SET @Sales = NULL;
END
INSERT INTO #Preparation (Date, Hour, Items, Sales)
VALUES (DATEADD(DAY, @CounterDay, '2011-01-01'), @CounterHour + 13, @Items, @Sales);
SET @CounterHour += 1;
END
INSERT INTO #Calendar (Date) VALUES (DATEADD(DAY, @CounterDay, '2011-01-01'));
SET @CounterDay += 1;
END
--Here the query starts.
;WITH P AS (
SELECT DATEADD(HOUR, Hour, Date) AS Hour
,Sales
,Items
FROM #Preparation
),
O AS (
SELECT DISTINCT DATEADD(HOUR, SV.number, C.Date) AS Hour
FROM #OpenHours AS O
JOIN #Calendar AS C ON O.Day = DATEPART(WEEKDAY, C.Date)
JOIN master.dbo.spt_values AS SV ON SV.number BETWEEN DATEPART(HOUR, O.OpenFrom) AND DATEPART(HOUR, O.OpenTo)
),
S AS (
SELECT O.Hour, P.Sales, P.Items
FROM O
LEFT JOIN P ON P.Hour = O.Hour
)
--block comment here case 1 (slow performing)
--With this technique it takes about 34 seconds.
,N AS (
SELECT
A.Hour
,A.Sales AS SalesOrg
,CASE WHEN COALESCE(B.Sales, C.Sales, 1) < 0
THEN 0 ELSE COALESCE(B.Sales, C.Sales, 1) END AS Sales
,A.Items AS ItemsOrg
,COALESCE(B.Items, C.Items, 1) AS Items
FROM S AS A
OUTER APPLY (SELECT TOP 1 *
FROM S
WHERE Hour <= A.Hour
AND Sales IS NOT NULL
AND DATEDIFF(DAY, Hour, A.Hour) = 0
ORDER BY Hour DESC) B
OUTER APPLY (SELECT TOP 1 *
FROM S
WHERE Sales IS NOT NULL
AND DATEDIFF(DAY, Hour, A.Hour) = 0
ORDER BY Hour) C
)
--end block comment here case 1 (slow performing)
/*--block comment here case 2 (fast performing)
--With this technique it takes about 2 seconds.
SELECT * INTO #tmpS FROM S;
WITH
N AS (
SELECT
A.Hour
,A.Sales AS SalesOrg
,CASE WHEN COALESCE(B.Sales, C.Sales, 1) < 0
THEN 0 ELSE COALESCE(B.Sales, C.Sales, 1) END AS Sales
,A.Items AS ItemsOrg
,COALESCE(B.Items, C.Items, 1) AS Items
FROM #tmpS AS A
OUTER APPLY (SELECT TOP 1 *
FROM #tmpS
WHERE Hour <= A.Hour
AND Sales IS NOT NULL
AND DATEDIFF(DAY, Hour, A.Hour) = 0
ORDER BY Hour DESC) B
OUTER APPLY (SELECT TOP 1 *
FROM #tmpS
WHERE Sales IS NOT NULL
AND DATEDIFF(DAY, Hour, A.Hour) = 0
ORDER BY Hour) C
)
--end block comment here case 2 (fast performing)*/
SELECT * FROM N ORDER BY Hour
IF OBJECT_ID('tempdb..#tmpS') IS NOT NULL DROP TABLE #tmpS;
DROP TABLE #Preparation;
DROP TABLE #Calendar;
DROP TABLE #OpenHours;
If you would like to try and understand what I am doing in the last step I have a SO question about it here.
For me case 1 takes about 34 seconds and case 2 takes about 2 seconds. The difference is that I store the result from S in a temp table in case 2, in case 1 I use S in my next CTE directly.
This can result in performance gains. Also, if you have a complicated CTE (subquery) that is used more than once, then storing it in a temporary table will often give a performance boost.
Looking at SQL Profiler results from these queries (each were run 10 times and averages are below) we can see that the CTE just slightly outperforms both the temporary table and table variable queries when it comes to overall duration.
Disadvantages of CTECTE's members cannot use the following clauses of keywords Distinct, Group By, Having, Top, Joins limiting by this type of the queries that can be created and reducing their complexity. The Recursive member can refer to the CTE only once.
A CTE
is essentially just a disposable view. It will pretty much never make a query any faster than just putting the CTE
code into a FROM
clause as a table expression.
In your example, the real issue is the date functions I believe.
Your first (slow) case requires the date functions to be run for every row.
For your second (faster) case they are run once and stored into a table.
This is not normally so noticeable unless you do some sort of logic on the function-derived field. In your case you are doing an ORDER BY
on Hour
, which is very costly. In your second example it's a simple sort on a field, but in the first you are running that function for each row, THEN sorting.
For a lot more in-depth reading on CTEs, see this question on DBA.SE.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With