Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculate statistics about duration between timestamped data

Tags:

sql

sql-server

I am using SQL Server to store data about ticket validation. Single ticket can be validated at multiple places. I need to group records by "entry" and "exit" place and calculate statistics about duration which has passed between two validations. Here is the table (simplified for clarity) :

CREATE TABLE TestDuration
(VALIDATION_TIMESTAMP datetime, 
ID_TICKET bigint, 
ID_PLACE bigint)

And data:

INSERT INTO TestDuration(VALIDATION_TIMESTAMP,ID_TICKET,ID_PLACE) VALUES ('2012-07-25 19:24:05.700', 1, 1)
INSERT INTO TestDuration(VALIDATION_TIMESTAMP,ID_TICKET,ID_PLACE) VALUES ('2012-07-25 20:08:04.250', 2, 2)
INSERT INTO TestDuration(VALIDATION_TIMESTAMP,ID_TICKET,ID_PLACE) VALUES ('2012-07-26 10:18:13.040', 3, 3)
INSERT INTO TestDuration(VALIDATION_TIMESTAMP,ID_TICKET,ID_PLACE) VALUES ('2012-07-26 10:18:20.990', 1, 2)
INSERT INTO TestDuration(VALIDATION_TIMESTAMP,ID_TICKET,ID_PLACE) VALUES ('2012-07-26 10:18:29.290', 2, 4)
INSERT INTO TestDuration(VALIDATION_TIMESTAMP,ID_TICKET,ID_PLACE) VALUES ('2012-07-26 10:25:37.040', 1, 4)

Here is the aggregation query:

SELECT VisitDurationCalcTable.ID_PLACE AS ID_PLACE_IN, 
VisitDurationCalcTable.ID_NEXT_VISIT_PLACE AS ID_PLACE_OUT, 
COUNT(visitduration) AS NUMBER_OF_VISITS, AVG(visitduration) AS AVERAGE_VISIT_DURATION 
FROM (
      SELECT EntryData.VALIDATION_TIMESTAMP, EntryData.ID_TICKET, EntryData.ID_PLACE, 
      (
       SELECT TOP 1 ID_PLACE FROM TestDuration 
          WHERE ID_TICKET=EntryData.ID_TICKET 
          AND VALIDATION_TIMESTAMP>EntryData.VALIDATION_TIMESTAMP 
          ORDER BY VALIDATION_TIMESTAMP ASC
      ) 
      AS ID_NEXT_VISIT_PLACE, 
      DATEDIFF(n,EntryData.VALIDATION_TIMESTAMP,
               (
                SELECT TOP 1 VALIDATION_TIMESTAMP FROM TestDuration WHERE ID_TICKET=EntryData.ID_TICKET and VALIDATION_TIMESTAMP>EntryData.VALIDATION_TIMESTAMP ORDER BY VALIDATION_TIMESTAMP ASC
               )
              ) AS visitduration 
     FROM TestDuration EntryData)
AS VisitDurationCalcTable 
WHERE VisitDurationCalcTable.ID_NEXT_VISIT_PLACE IS NOT NULL
GROUP BY VisitDurationCalcTable.ID_PLACE, VisitDurationCalcTable.ID_NEXT_VISIT_PLACE

The query works, but I've hit a performance problem pretty fast. For 40K rows in table query execution time is about 3 minutes. I'm no SQL guru so cannot really see how to transform the query to work faster. It's not a critical report and is made only about once per month, but nevertheless it makes my app look bad. I have a feeling I'm missing something simple here.

like image 957
Janeks Bergs Avatar asked Dec 18 '12 15:12

Janeks Bergs


People also ask

How do you calculate timestamp difference?

If you'd like to calculate the difference between the timestamps in seconds, multiply the decimal difference in days by the number of seconds in a day, which equals 24 * 60 * 60 = 86400 , or the product of the number of hours in a day, the number of minutes in an hour, and the number of seconds in a minute.

How can I find the difference between two timestamps in SQL?

{fn TIMESTAMPDIFF(interval,startDate,endDate)} returns the difference between the starting and ending timestamps (startDate minus endDate) for the specified date part interval (seconds, days, weeks, and so on). The function returns an INTEGER value representing the number of intervals between the two timestamps.

How can calculate date in SQL?

Discussion: To find the difference between dates, use the DATEDIFF(datepart, startdate, enddate) function. The datepart argument defines the part of the date/datetime in which you'd like to express the difference. Its value can be year , quarter , month , day , minute , etc.


1 Answers

TLDR Version

You are clearly missing an index that would help this query. Adding the missing index will likely cause an order of magnitude improvement on its own.

If you are on SQL Server 2012 rewriting the query using LEAD would also do it (though that too would benefit from the missing index).

If you are still on 2005/2008 then you can make some improvements to the existing query but the effect will be relatively minor compared to the index change.

Longer Version

For this to take 3 minutes I assume you have no useful indexes at all and that the biggest win would be to simply add an index (for a report run once a month simply copying the data from the three columns into an appropriately indexed #temp table might suffice if you don't want to create a permanent index).

You say that you simplified the table for clarity and that it has 40K rows. Assuming the following test data

CREATE TABLE TestDuration
  (
     Id                   UNIQUEIDENTIFIER DEFAULT NEWID() PRIMARY KEY,
     VALIDATION_TIMESTAMP DATETIME,
     ID_TICKET            BIGINT,
     ID_PLACE             BIGINT,
     OtherColumns         CHAR(100) NULL
  )

INSERT INTO TestDuration
            (VALIDATION_TIMESTAMP,
             ID_TICKET,
             ID_PLACE)
SELECT TOP 40000 DATEADD(minute, ROW_NUMBER() OVER (ORDER BY (SELECT 0)), GETDATE()),
                 ABS(CHECKSUM(NEWID())) % 10,
                 ABS(CHECKSUM(NEWID())) % 100
FROM   master..spt_values v1,
       master..spt_values v2 

Your original query takes 51 seconds on my machine at MAXDOP 1 and the following IO stats

Table 'Worktable'. Scan count 79990, logical reads 1167573, physical reads 0
Table 'TestDuration'. Scan count 3, logical reads 2472, physical reads 0.

Plan 1

For each of the 40,000 rows in the table it is doing two sorts of all matching ID_TICKET rows in order to identify the next one in order of VALIDATION_TIMESTAMP

Simply adding an index as below brings the elapsed time down to 406ms, an improvement of more than 100 times (the subsequent queries in this answer assume this index is now in place).

CREATE NONCLUSTERED INDEX IX
  ON TestDuration(ID_TICKET, VALIDATION_TIMESTAMP)
  INCLUDE (ID_PLACE) 

The plan now looks as follows with the 80,000 sorts and spool operations replaced with index seeks.

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0
Table 'TestDuration'. Scan count 79991, logical reads 255707, physical reads 0

plan 2

It is still doing 2 seeks for every row however. Rewriting with CROSS APPLY allows these to be combined.

SELECT VisitDurationCalcTable.ID_PLACE            AS ID_PLACE_IN,
       VisitDurationCalcTable.ID_NEXT_VISIT_PLACE AS ID_PLACE_OUT,
       COUNT(visitduration)                       AS NUMBER_OF_VISITS,
       AVG(visitduration)                         AS AVERAGE_VISIT_DURATION
FROM   (SELECT EntryData.VALIDATION_TIMESTAMP,
               EntryData.ID_TICKET,
               EntryData.ID_PLACE,
               CA.ID_PLACE                                                          AS ID_NEXT_VISIT_PLACE,
               DATEDIFF(n, EntryData.VALIDATION_TIMESTAMP, CA.VALIDATION_TIMESTAMP) AS visitduration
        FROM   TestDuration EntryData
               CROSS APPLY (SELECT TOP 1 ID_PLACE,
                                         VALIDATION_TIMESTAMP
                            FROM   TestDuration
                            WHERE  ID_TICKET = EntryData.ID_TICKET
                                   AND VALIDATION_TIMESTAMP > EntryData.VALIDATION_TIMESTAMP
                            ORDER  BY VALIDATION_TIMESTAMP ASC) CA) AS VisitDurationCalcTable
GROUP  BY VisitDurationCalcTable.ID_PLACE,
          VisitDurationCalcTable.ID_NEXT_VISIT_PLACE 

This gives me an elapsed time of 269 ms

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0
Table 'TestDuration'. Scan count 40001, logical reads 127988, physical reads 0

PLAN 3

Whilst the number of reads is still quite high the seeks are all reading pages that have just been read by the scan so they are all pages in cache. The number of reads can be reduced by using a table variable.

DECLARE @T TABLE (
  VALIDATION_TIMESTAMP DATETIME,
  ID_TICKET            BIGINT,
  ID_PLACE             BIGINT,
  RN                   INT
  PRIMARY KEY(ID_TICKET, RN) )

INSERT INTO @T
SELECT VALIDATION_TIMESTAMP,
       ID_TICKET,
       ID_PLACE,
       ROW_NUMBER() OVER (PARTITION BY ID_TICKET ORDER BY VALIDATION_TIMESTAMP) AS RN
FROM   TestDuration

SELECT T1.ID_PLACE                                                        AS ID_PLACE_IN,
       T2.ID_PLACE                                                        AS ID_PLACE_OUT,
       COUNT(*)                                                           AS NUMBER_OF_VISITS,
       AVG(DATEDIFF(n, T1.VALIDATION_TIMESTAMP, T2.VALIDATION_TIMESTAMP)) AS AVERAGE_VISIT_DURATION
FROM   @T T1
       INNER MERGE JOIN @T T2
         ON T1.ID_TICKET = T2.ID_TICKET
            AND T2.RN = T1.RN + 1
GROUP  BY T1.ID_PLACE,
          T2.ID_PLACE 

However for me at least that slightly increased the elapsed time to 301 ms (43 ms for the insert + 258 ms for the select) but this could still be a good option in lieu of creating a permanent index.

(Insert)
Table 'TestDuration'. Scan count 1, logical reads 233, physical reads 0

(Select)
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0
Table '#0C50D423'. Scan count 2, logical reads 372, physical reads 0

Plan

Finally if you are using SQL Server 2012 you can use LEAD (SQL Fiddle)

WITH CTE
     AS (SELECT ID_PLACE AS ID_PLACE_IN,
                LEAD(ID_PLACE) OVER (PARTITION BY ID_TICKET 
                                         ORDER BY VALIDATION_TIMESTAMP) AS ID_PLACE_OUT,
                DATEDIFF(n, 
                         VALIDATION_TIMESTAMP, 
                         LEAD(VALIDATION_TIMESTAMP) OVER (PARTITION BY ID_TICKET 
                                                              ORDER BY VALIDATION_TIMESTAMP)) AS VISIT_DURATION
         FROM   TestDuration)
SELECT ID_PLACE_IN,
       ID_PLACE_OUT,
       COUNT(*)            AS NUMBER_OF_VISITS,
       AVG(VISIT_DURATION) AS AVERAGE_VISIT_DURATION
FROM   CTE
WHERE  ID_PLACE_OUT IS NOT NULL
GROUP  BY ID_PLACE_IN,
          ID_PLACE_OUT 

That gave me an elapsed time of 249 ms

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0
Table 'TestDuration'. Scan count 1, logical reads 233, physical reads 0

PLAN 4

The LEAD version also performs well without the index. Omitting the optimal index adds an additional SORT to the plan and means it has to read the wider clustered index on my test table but it still completed in an elapsed time of 293 ms.

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0
Table 'TestDuration'. Scan count 1, logical reads 824, physical reads 0
like image 100
Martin Smith Avatar answered Sep 28 '22 18:09

Martin Smith