Calculate statistics about duration between timestamped data

Tags:

I am using SQL Server to store data about ticket validation. Single ticket can be validated at multiple places. I need to group records by "entry" and "exit" place and calculate statistics about duration which has passed between two validations. Here is the table (simplified for clarity) :

CREATE TABLE TestDuration
(VALIDATION_TIMESTAMP datetime, 
ID_TICKET bigint, 
ID_PLACE bigint)

And data:

INSERT INTO TestDuration(VALIDATION_TIMESTAMP,ID_TICKET,ID_PLACE) VALUES ('2012-07-25 19:24:05.700', 1, 1)
INSERT INTO TestDuration(VALIDATION_TIMESTAMP,ID_TICKET,ID_PLACE) VALUES ('2012-07-25 20:08:04.250', 2, 2)
INSERT INTO TestDuration(VALIDATION_TIMESTAMP,ID_TICKET,ID_PLACE) VALUES ('2012-07-26 10:18:13.040', 3, 3)
INSERT INTO TestDuration(VALIDATION_TIMESTAMP,ID_TICKET,ID_PLACE) VALUES ('2012-07-26 10:18:20.990', 1, 2)
INSERT INTO TestDuration(VALIDATION_TIMESTAMP,ID_TICKET,ID_PLACE) VALUES ('2012-07-26 10:18:29.290', 2, 4)
INSERT INTO TestDuration(VALIDATION_TIMESTAMP,ID_TICKET,ID_PLACE) VALUES ('2012-07-26 10:25:37.040', 1, 4)

Here is the aggregation query:

SELECT VisitDurationCalcTable.ID_PLACE AS ID_PLACE_IN, 
VisitDurationCalcTable.ID_NEXT_VISIT_PLACE AS ID_PLACE_OUT, 
COUNT(visitduration) AS NUMBER_OF_VISITS, AVG(visitduration) AS AVERAGE_VISIT_DURATION 
FROM (
      SELECT EntryData.VALIDATION_TIMESTAMP, EntryData.ID_TICKET, EntryData.ID_PLACE, 
      (
       SELECT TOP 1 ID_PLACE FROM TestDuration 
          WHERE ID_TICKET=EntryData.ID_TICKET 
          AND VALIDATION_TIMESTAMP>EntryData.VALIDATION_TIMESTAMP 
          ORDER BY VALIDATION_TIMESTAMP ASC
      ) 
      AS ID_NEXT_VISIT_PLACE, 
      DATEDIFF(n,EntryData.VALIDATION_TIMESTAMP,
               (
                SELECT TOP 1 VALIDATION_TIMESTAMP FROM TestDuration WHERE ID_TICKET=EntryData.ID_TICKET and VALIDATION_TIMESTAMP>EntryData.VALIDATION_TIMESTAMP ORDER BY VALIDATION_TIMESTAMP ASC
               )
              ) AS visitduration 
     FROM TestDuration EntryData)
AS VisitDurationCalcTable 
WHERE VisitDurationCalcTable.ID_NEXT_VISIT_PLACE IS NOT NULL
GROUP BY VisitDurationCalcTable.ID_PLACE, VisitDurationCalcTable.ID_NEXT_VISIT_PLACE

The query works, but I've hit a performance problem pretty fast. For 40K rows in table query execution time is about 3 minutes. I'm no SQL guru so cannot really see how to transform the query to work faster. It's not a critical report and is made only about once per month, but nevertheless it makes my app look bad. I have a feeling I'm missing something simple here.

957

asked Dec 18 '12 15:12

Janeks Bergs

1 Answers

TLDR Version

You are clearly missing an index that would help this query. Adding the missing index will likely cause an order of magnitude improvement on its own.

If you are on SQL Server 2012 rewriting the query using LEAD would also do it (though that too would benefit from the missing index).

If you are still on 2005/2008 then you can make some improvements to the existing query but the effect will be relatively minor compared to the index change.

Longer Version

For this to take 3 minutes I assume you have no useful indexes at all and that the biggest win would be to simply add an index (for a report run once a month simply copying the data from the three columns into an appropriately indexed #temp table might suffice if you don't want to create a permanent index).

You say that you simplified the table for clarity and that it has 40K rows. Assuming the following test data

CREATE TABLE TestDuration
  (
     Id                   UNIQUEIDENTIFIER DEFAULT NEWID() PRIMARY KEY,
     VALIDATION_TIMESTAMP DATETIME,
     ID_TICKET            BIGINT,
     ID_PLACE             BIGINT,
     OtherColumns         CHAR(100) NULL
  )

INSERT INTO TestDuration
            (VALIDATION_TIMESTAMP,
             ID_TICKET,
             ID_PLACE)
SELECT TOP 40000 DATEADD(minute, ROW_NUMBER() OVER (ORDER BY (SELECT 0)), GETDATE()),
                 ABS(CHECKSUM(NEWID())) % 10,
                 ABS(CHECKSUM(NEWID())) % 100
FROM   master..spt_values v1,
       master..spt_values v2

Your original query takes 51 seconds on my machine at MAXDOP 1 and the following IO stats

Table 'Worktable'. Scan count 79990, logical reads 1167573, physical reads 0
Table 'TestDuration'. Scan count 3, logical reads 2472, physical reads 0.

Plan 1

For each of the 40,000 rows in the table it is doing two sorts of all matching ID_TICKET rows in order to identify the next one in order of VALIDATION_TIMESTAMP

Simply adding an index as below brings the elapsed time down to 406ms, an improvement of more than 100 times (the subsequent queries in this answer assume this index is now in place).

CREATE NONCLUSTERED INDEX IX
  ON TestDuration(ID_TICKET, VALIDATION_TIMESTAMP)
  INCLUDE (ID_PLACE)

The plan now looks as follows with the 80,000 sorts and spool operations replaced with index seeks.

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0
Table 'TestDuration'. Scan count 79991, logical reads 255707, physical reads 0

plan 2

It is still doing 2 seeks for every row however. Rewriting with CROSS APPLY allows these to be combined.

SELECT VisitDurationCalcTable.ID_PLACE            AS ID_PLACE_IN,
       VisitDurationCalcTable.ID_NEXT_VISIT_PLACE AS ID_PLACE_OUT,
       COUNT(visitduration)                       AS NUMBER_OF_VISITS,
       AVG(visitduration)                         AS AVERAGE_VISIT_DURATION
FROM   (SELECT EntryData.VALIDATION_TIMESTAMP,
               EntryData.ID_TICKET,
               EntryData.ID_PLACE,
               CA.ID_PLACE                                                          AS ID_NEXT_VISIT_PLACE,
               DATEDIFF(n, EntryData.VALIDATION_TIMESTAMP, CA.VALIDATION_TIMESTAMP) AS visitduration
        FROM   TestDuration EntryData
               CROSS APPLY (SELECT TOP 1 ID_PLACE,
                                         VALIDATION_TIMESTAMP
                            FROM   TestDuration
                            WHERE  ID_TICKET = EntryData.ID_TICKET
                                   AND VALIDATION_TIMESTAMP > EntryData.VALIDATION_TIMESTAMP
                            ORDER  BY VALIDATION_TIMESTAMP ASC) CA) AS VisitDurationCalcTable
GROUP  BY VisitDurationCalcTable.ID_PLACE,
          VisitDurationCalcTable.ID_NEXT_VISIT_PLACE

This gives me an elapsed time of 269 ms

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0
Table 'TestDuration'. Scan count 40001, logical reads 127988, physical reads 0

PLAN 3

Whilst the number of reads is still quite high the seeks are all reading pages that have just been read by the scan so they are all pages in cache. The number of reads can be reduced by using a table variable.

DECLARE @T TABLE (
  VALIDATION_TIMESTAMP DATETIME,
  ID_TICKET            BIGINT,
  ID_PLACE             BIGINT,
  RN                   INT
  PRIMARY KEY(ID_TICKET, RN) )

INSERT INTO @T
SELECT VALIDATION_TIMESTAMP,
       ID_TICKET,
       ID_PLACE,
       ROW_NUMBER() OVER (PARTITION BY ID_TICKET ORDER BY VALIDATION_TIMESTAMP) AS RN
FROM   TestDuration

SELECT T1.ID_PLACE                                                        AS ID_PLACE_IN,
       T2.ID_PLACE                                                        AS ID_PLACE_OUT,
       COUNT(*)                                                           AS NUMBER_OF_VISITS,
       AVG(DATEDIFF(n, T1.VALIDATION_TIMESTAMP, T2.VALIDATION_TIMESTAMP)) AS AVERAGE_VISIT_DURATION
FROM   @T T1
       INNER MERGE JOIN @T T2
         ON T1.ID_TICKET = T2.ID_TICKET
            AND T2.RN = T1.RN + 1
GROUP  BY T1.ID_PLACE,
          T2.ID_PLACE

However for me at least that slightly increased the elapsed time to 301 ms (43 ms for the insert + 258 ms for the select) but this could still be a good option in lieu of creating a permanent index.

(Insert)
Table 'TestDuration'. Scan count 1, logical reads 233, physical reads 0

(Select)
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0
Table '#0C50D423'. Scan count 2, logical reads 372, physical reads 0

Plan

Finally if you are using SQL Server 2012 you can use LEAD (SQL Fiddle)

WITH CTE
     AS (SELECT ID_PLACE AS ID_PLACE_IN,
                LEAD(ID_PLACE) OVER (PARTITION BY ID_TICKET 
                                         ORDER BY VALIDATION_TIMESTAMP) AS ID_PLACE_OUT,
                DATEDIFF(n, 
                         VALIDATION_TIMESTAMP, 
                         LEAD(VALIDATION_TIMESTAMP) OVER (PARTITION BY ID_TICKET 
                                                              ORDER BY VALIDATION_TIMESTAMP)) AS VISIT_DURATION
         FROM   TestDuration)
SELECT ID_PLACE_IN,
       ID_PLACE_OUT,
       COUNT(*)            AS NUMBER_OF_VISITS,
       AVG(VISIT_DURATION) AS AVERAGE_VISIT_DURATION
FROM   CTE
WHERE  ID_PLACE_OUT IS NOT NULL
GROUP  BY ID_PLACE_IN,
          ID_PLACE_OUT

That gave me an elapsed time of 249 ms

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0
Table 'TestDuration'. Scan count 1, logical reads 233, physical reads 0

PLAN 4

The LEAD version also performs well without the index. Omitting the optimal index adds an additional SORT to the plan and means it has to read the wider clustered index on my test table but it still completed in an elapsed time of 293 ms.

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0
Table 'TestDuration'. Scan count 1, logical reads 824, physical reads 0

100

answered Sep 28 '22 18:09

Martin Smith

Related questions
                            
                                Recurring Events, SQL Query
                            
                                DBMS_XPLAN.DISPLAY_CURSOR vs Explain Plan if not using gather_plan_statistics hint
                            
                                Finding Duplicate Orders (by time proximity)
                            
                                Oracle SQL using Like for a wildcard
                            
                                Many-column UPDATE-JOIN with many-ISNULL takes a long time?
                            
                                Using a SQL DataReader to capture inserted or deleted values
                            
                                How to get elasticsearch to perform similar to SQL 'LIKE'
                            
                                Advanced SQL in Rails
                            
                                LINQ to SQL - updating records
                            
                                When does sql exclusively lock a row in an update statement?
                            
                                Optimize SQL Query on SQLite3 by using indexes
                            
                                Getting counts/totals at each level of a hierarchical query using CONNECT BY
                            
                                Vertica: Data validation of duplicate/primary key
                            
                                Select second highest value per distinct foreign key
                            
                                Data grouping in SQL
                            
                                T-SQL ORDER BY according to a condition
                            
                                Parsing XML with unknown namespaces in Oracle SQL
                            
                                MySQL: how to get x number of results per grouping [duplicate]
                            
                                Database temporarily disconnected after a lots of transactions by pgbench
                            
                                Optimize postgresql query

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Calculate statistics about duration between timestamped data

Tags:

sql

sql-server

Janeks Bergs

People also ask

1 Answers

Martin Smith

Recent Activity

Donate For Us