Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Trying to optimize a query that picks an 'approximate closest record'

I have a table that contains a lot of data, where we particularly care about the date field. The reason for this is that the data volume just went up ~30x, and the old ways will fall apart soon. The query I hope you can help me optimize needs to:

  • take a list of dates (generated by a cte based table-valued function)
  • retrieve a single record for each of those dates
    • based on some definition of 'nearest'

For instance, the current table contains data in 5 second (+/- a little) intervals. I need to sample that table and get the record that falls closest to a 30 second interval.

What I have right now works just fine. I'm simply curious if there is a way to optimize it more. If I can do it in Linq To SQL, that would be neat too. I am even interested in suggestions on indexes, given the amount of date values (~2 million rows min).

declare @st  datetime ; set @st  = '2012-01-31 05:05:00';
declare @end datetime ; set @end = '2012-01-31 05:10:00';

select distinct
    log.*   -- id, 
from 
    dbo.fn_GenerateDateSteps(@st, @end, 30) as d
        inner join lotsOfLogData log on l.Id = (
            select top 1 e.[Id]
            from 
                lotsOfLogData as log  -- contains data in 5 second intervals
            where
                log.stationId = 1000 
                -- search for dates in a certain range
                AND utcTime between DateAdd(s, -10, dt) AND DateAdd(s, 5, dt)
            order by
                -- get the 'closest'. this can change a little, but will always 
                -- be based on a difference between the date
                abs(datediff(s, dt, UtcTime)) 
        )
    -- updated the query to be correct. stadionId should be inside the subquery

The table structure of lotsOfLogData is below. There are relatively few station IDs (maybe 50), but lots of records for each. We know the station id when we query.

create table ##lotsOfLogData (
    Id          bigint      identity(1,1) not null
,   StationId   int         not null
,   UtcTime     datetime    not null
    -- 20 other fields, used for other calculations
)

fn_GenerateDateSteps returns a dataset like this, for the parameters given:

[DT]
2012-01-31 05:05:00.000
2012-01-31 05:05:30.000
2012-01-31 05:06:00.000
2012-01-31 05:06:30.000  (and so on, every 30 seconds)

I have done this with a temporary table as well, in this manner, but that came out just a little bit more expensive.

declare @dates table ( dt datetime, ClosestId bigint); 
insert into @dates (dt) select dt from dbo.fn_GenerateDateSteps(@st, @end, 30)
update @dates set closestId = ( -- same subquery as above )
select * from lotsOfLogData inner join @dates on Id = ClosestId

Edit: Fixed up

Got 200K+ rows to work with now. I tried both ways, and the cross apply with an appropriate index (id/time + includes(..all columns...) worked just fine. However, I ended up with the query I started, using a simpler (and existing) index on [id+time]. The more widely understandable query is why I settled on that one. Maybe there still is a better way to do it, but I can't see it :D

-- subtree cost (crossapply) : .0808
-- subtree cost (id based)   : .0797

-- see above query for what i ended up with
like image 611
Andrew Avatar asked Nov 04 '22 05:11

Andrew


1 Answers

You could try

  • changing the inner join to a cross apply.
  • Move the where log.stationid to the subselect.

SQL Statement

SELECT  DISTINCT log.*   -- id, 
FROM    dbo.fn_GenerateDateSteps(@st, @end, 30) AS d
        CROSS APPLY (
            SELECT  TOP 1 log.*
            FROM    lotsOfLogData AS log  -- contains data in 5 second intervals
            WHERE   -- search for dates in a certain range
                    utcTime between DATEADD(s, -10, d.dt) AND DATEADD(s, 5, d.dt)
                    AND log.stationid = 1000
            ORDER BY
                    -- get the 'closest'. this can change a little, but will always 
                    -- be based on a difference between the date
                    ABS(DATEDIFF(s, d.dt, UtcTime)) 
        ) log
like image 82
Lieven Keersmaekers Avatar answered Nov 10 '22 05:11

Lieven Keersmaekers