I am using SQL Server to store data about ticket validation. Single ticket can be validated at multiple places. I need to group records by "entry" and "exit" place and calculate statistics about duration which has passed between two validations. Here is the table (simplified for clarity) :
CREATE TABLE TestDuration
(VALIDATION_TIMESTAMP datetime,
ID_TICKET bigint,
ID_PLACE bigint)
And data:
INSERT INTO TestDuration(VALIDATION_TIMESTAMP,ID_TICKET,ID_PLACE) VALUES ('2012-07-25 19:24:05.700', 1, 1)
INSERT INTO TestDuration(VALIDATION_TIMESTAMP,ID_TICKET,ID_PLACE) VALUES ('2012-07-25 20:08:04.250', 2, 2)
INSERT INTO TestDuration(VALIDATION_TIMESTAMP,ID_TICKET,ID_PLACE) VALUES ('2012-07-26 10:18:13.040', 3, 3)
INSERT INTO TestDuration(VALIDATION_TIMESTAMP,ID_TICKET,ID_PLACE) VALUES ('2012-07-26 10:18:20.990', 1, 2)
INSERT INTO TestDuration(VALIDATION_TIMESTAMP,ID_TICKET,ID_PLACE) VALUES ('2012-07-26 10:18:29.290', 2, 4)
INSERT INTO TestDuration(VALIDATION_TIMESTAMP,ID_TICKET,ID_PLACE) VALUES ('2012-07-26 10:25:37.040', 1, 4)
Here is the aggregation query:
SELECT VisitDurationCalcTable.ID_PLACE AS ID_PLACE_IN,
VisitDurationCalcTable.ID_NEXT_VISIT_PLACE AS ID_PLACE_OUT,
COUNT(visitduration) AS NUMBER_OF_VISITS, AVG(visitduration) AS AVERAGE_VISIT_DURATION
FROM (
SELECT EntryData.VALIDATION_TIMESTAMP, EntryData.ID_TICKET, EntryData.ID_PLACE,
(
SELECT TOP 1 ID_PLACE FROM TestDuration
WHERE ID_TICKET=EntryData.ID_TICKET
AND VALIDATION_TIMESTAMP>EntryData.VALIDATION_TIMESTAMP
ORDER BY VALIDATION_TIMESTAMP ASC
)
AS ID_NEXT_VISIT_PLACE,
DATEDIFF(n,EntryData.VALIDATION_TIMESTAMP,
(
SELECT TOP 1 VALIDATION_TIMESTAMP FROM TestDuration WHERE ID_TICKET=EntryData.ID_TICKET and VALIDATION_TIMESTAMP>EntryData.VALIDATION_TIMESTAMP ORDER BY VALIDATION_TIMESTAMP ASC
)
) AS visitduration
FROM TestDuration EntryData)
AS VisitDurationCalcTable
WHERE VisitDurationCalcTable.ID_NEXT_VISIT_PLACE IS NOT NULL
GROUP BY VisitDurationCalcTable.ID_PLACE, VisitDurationCalcTable.ID_NEXT_VISIT_PLACE
The query works, but I've hit a performance problem pretty fast. For 40K rows in table query execution time is about 3 minutes. I'm no SQL guru so cannot really see how to transform the query to work faster. It's not a critical report and is made only about once per month, but nevertheless it makes my app look bad. I have a feeling I'm missing something simple here.
If you'd like to calculate the difference between the timestamps in seconds, multiply the decimal difference in days by the number of seconds in a day, which equals 24 * 60 * 60 = 86400 , or the product of the number of hours in a day, the number of minutes in an hour, and the number of seconds in a minute.
{fn TIMESTAMPDIFF(interval,startDate,endDate)} returns the difference between the starting and ending timestamps (startDate minus endDate) for the specified date part interval (seconds, days, weeks, and so on). The function returns an INTEGER value representing the number of intervals between the two timestamps.
Discussion: To find the difference between dates, use the DATEDIFF(datepart, startdate, enddate) function. The datepart argument defines the part of the date/datetime in which you'd like to express the difference. Its value can be year , quarter , month , day , minute , etc.
TLDR Version
You are clearly missing an index that would help this query. Adding the missing index will likely cause an order of magnitude improvement on its own.
If you are on SQL Server 2012 rewriting the query using LEAD
would also do it (though that too would benefit from the missing index).
If you are still on 2005/2008 then you can make some improvements to the existing query but the effect will be relatively minor compared to the index change.
Longer Version
For this to take 3 minutes I assume you have no useful indexes at all and that the biggest win would be to simply add an index (for a report run once a month simply copying the data from the three columns into an appropriately indexed #temp
table might suffice if you don't want to create a permanent index).
You say that you simplified the table for clarity and that it has 40K rows. Assuming the following test data
CREATE TABLE TestDuration
(
Id UNIQUEIDENTIFIER DEFAULT NEWID() PRIMARY KEY,
VALIDATION_TIMESTAMP DATETIME,
ID_TICKET BIGINT,
ID_PLACE BIGINT,
OtherColumns CHAR(100) NULL
)
INSERT INTO TestDuration
(VALIDATION_TIMESTAMP,
ID_TICKET,
ID_PLACE)
SELECT TOP 40000 DATEADD(minute, ROW_NUMBER() OVER (ORDER BY (SELECT 0)), GETDATE()),
ABS(CHECKSUM(NEWID())) % 10,
ABS(CHECKSUM(NEWID())) % 100
FROM master..spt_values v1,
master..spt_values v2
Your original query takes 51 seconds on my machine at MAXDOP 1
and the following IO stats
Table 'Worktable'. Scan count 79990, logical reads 1167573, physical reads 0
Table 'TestDuration'. Scan count 3, logical reads 2472, physical reads 0.
For each of the 40,000 rows in the table it is doing two sorts of all matching ID_TICKET
rows in order to identify the next one in order of VALIDATION_TIMESTAMP
Simply adding an index as below brings the elapsed time down to 406ms, an improvement of more than 100 times (the subsequent queries in this answer assume this index is now in place).
CREATE NONCLUSTERED INDEX IX
ON TestDuration(ID_TICKET, VALIDATION_TIMESTAMP)
INCLUDE (ID_PLACE)
The plan now looks as follows with the 80,000 sorts and spool operations replaced with index seeks.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0
Table 'TestDuration'. Scan count 79991, logical reads 255707, physical reads 0
It is still doing 2 seeks for every row however. Rewriting with CROSS APPLY
allows these to be combined.
SELECT VisitDurationCalcTable.ID_PLACE AS ID_PLACE_IN,
VisitDurationCalcTable.ID_NEXT_VISIT_PLACE AS ID_PLACE_OUT,
COUNT(visitduration) AS NUMBER_OF_VISITS,
AVG(visitduration) AS AVERAGE_VISIT_DURATION
FROM (SELECT EntryData.VALIDATION_TIMESTAMP,
EntryData.ID_TICKET,
EntryData.ID_PLACE,
CA.ID_PLACE AS ID_NEXT_VISIT_PLACE,
DATEDIFF(n, EntryData.VALIDATION_TIMESTAMP, CA.VALIDATION_TIMESTAMP) AS visitduration
FROM TestDuration EntryData
CROSS APPLY (SELECT TOP 1 ID_PLACE,
VALIDATION_TIMESTAMP
FROM TestDuration
WHERE ID_TICKET = EntryData.ID_TICKET
AND VALIDATION_TIMESTAMP > EntryData.VALIDATION_TIMESTAMP
ORDER BY VALIDATION_TIMESTAMP ASC) CA) AS VisitDurationCalcTable
GROUP BY VisitDurationCalcTable.ID_PLACE,
VisitDurationCalcTable.ID_NEXT_VISIT_PLACE
This gives me an elapsed time of 269 ms
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0
Table 'TestDuration'. Scan count 40001, logical reads 127988, physical reads 0
Whilst the number of reads is still quite high the seeks are all reading pages that have just been read by the scan so they are all pages in cache. The number of reads can be reduced by using a table variable.
DECLARE @T TABLE (
VALIDATION_TIMESTAMP DATETIME,
ID_TICKET BIGINT,
ID_PLACE BIGINT,
RN INT
PRIMARY KEY(ID_TICKET, RN) )
INSERT INTO @T
SELECT VALIDATION_TIMESTAMP,
ID_TICKET,
ID_PLACE,
ROW_NUMBER() OVER (PARTITION BY ID_TICKET ORDER BY VALIDATION_TIMESTAMP) AS RN
FROM TestDuration
SELECT T1.ID_PLACE AS ID_PLACE_IN,
T2.ID_PLACE AS ID_PLACE_OUT,
COUNT(*) AS NUMBER_OF_VISITS,
AVG(DATEDIFF(n, T1.VALIDATION_TIMESTAMP, T2.VALIDATION_TIMESTAMP)) AS AVERAGE_VISIT_DURATION
FROM @T T1
INNER MERGE JOIN @T T2
ON T1.ID_TICKET = T2.ID_TICKET
AND T2.RN = T1.RN + 1
GROUP BY T1.ID_PLACE,
T2.ID_PLACE
However for me at least that slightly increased the elapsed time to 301 ms (43 ms for the insert + 258 ms for the select) but this could still be a good option in lieu of creating a permanent index.
(Insert)
Table 'TestDuration'. Scan count 1, logical reads 233, physical reads 0
(Select)
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0
Table '#0C50D423'. Scan count 2, logical reads 372, physical reads 0
Finally if you are using SQL Server 2012 you can use LEAD
(SQL Fiddle)
WITH CTE
AS (SELECT ID_PLACE AS ID_PLACE_IN,
LEAD(ID_PLACE) OVER (PARTITION BY ID_TICKET
ORDER BY VALIDATION_TIMESTAMP) AS ID_PLACE_OUT,
DATEDIFF(n,
VALIDATION_TIMESTAMP,
LEAD(VALIDATION_TIMESTAMP) OVER (PARTITION BY ID_TICKET
ORDER BY VALIDATION_TIMESTAMP)) AS VISIT_DURATION
FROM TestDuration)
SELECT ID_PLACE_IN,
ID_PLACE_OUT,
COUNT(*) AS NUMBER_OF_VISITS,
AVG(VISIT_DURATION) AS AVERAGE_VISIT_DURATION
FROM CTE
WHERE ID_PLACE_OUT IS NOT NULL
GROUP BY ID_PLACE_IN,
ID_PLACE_OUT
That gave me an elapsed time of 249 ms
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0
Table 'TestDuration'. Scan count 1, logical reads 233, physical reads 0
The LEAD
version also performs well without the index. Omitting the optimal index adds an additional SORT
to the plan and means it has to read the wider clustered index on my test table but it still completed in an elapsed time of 293 ms.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0
Table 'TestDuration'. Scan count 1, logical reads 824, physical reads 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With