Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Select rows with closest timestamp

Tags:

sql

mysql

I have a table that looks something like the following - essentially containing a timestamp as well as some other columns:

WeatherTable
+---------------------+---------+----------------+      +
| TS                  | MonthET | InsideHumidity | .... |
+---------------------+---------+----------------+      |
| 2014-10-27 14:24:22 |       0 |             54 |      |
| 2014-10-27 14:24:24 |       0 |             54 |      |
| 2014-10-27 14:24:26 |       0 |             52 |      |
| 2014-10-27 14:24:28 |       0 |             54 |      |
| 2014-10-27 14:24:30 |       0 |             53 |      |
| 2014-10-27 14:24:32 |       0 |             55 |      |
| 2014-10-27 14:24:34 |       9 |             54 |      |
.......

I'm trying to formulate a SQL query that returns all rows within a certain timeframe (no problem here) with a certain arbitrary granularity, for instance, every 15 seconds. The number is always specified in seconds but is not limited to values less than 60. To complicate things further, the timestamps don't necessarily fall on the granularity required, so it's not a case of simply selecting the timestamp of 14:24:00, 14:24:15, 14:24:30, etc. - the row with the closest timestamp to each value needs to be included in the result.

For example, if the starting time was given as 14:24:30, the end time as 14:32:00, and the granularity was 130, the ideal times would be:

14:24:30
14:26:40
14:28:50
14:31:00

However, timestamps may not exist for each of those times, in which case the row with the closest timestamp to each of those ideal timestamps should be selected. In the case of two timestamps which are equally far away from the ideal timestamp, the earlier one should be selected.

The database is part of a web service, so presently I'm just ignoring the granularity in the SQL query and filtering the unwanted results out in (Java) code later. However, this seems far from ideal in terms of memory consumption and performance.

Any ideas?

like image 622
Michael Berry Avatar asked Apr 28 '26 09:04

Michael Berry


2 Answers

You could try to do it like this:

Create a list of time_intervals first. Using the stored procedure make_intervals from Get a list of dates between two dates create a temporary tables calling it somehow like that:

call make_intervals(@startdate,@enddate,15,'SECOND');

You will then have a table time_intervals with one of two columns named interval_start. Use this to find the closest Timestamp to each interval somehow like that:

CREATE TEMPORARY TABLE IF NOT EXISTS time_intervals_copy
  AS (SELECT * FROM time_intervals);

SELECT
  time_intervals.interval_start,
  WeatherTable.*
FROM time_intervals
JOIN WeatherTable
  ON WeatherTable.TS BETWEEN @startdate AND @enddate
JOIN (SELECT
        time_intervals.interval_start AS interval_start,
        MIN(ABS(time_intervals.interval_start - WeatherTable.TS)) AS ts_diff
      FROM time_intervals_copy AS time_intervals
      JOIN WeatherTable
      WHERE WeatherTable.TS BETWEEN @startdate AND @enddate
      GROUP BY time_intervals.interval_start) AS min
  ON min.interval_start = time_intervals.interval_start AND
     ABS(time_intervals.interval_start - WeatherTable.TS) = min.ts_diff
GROUP BY time_intervals.interval_start;

This will find the closest timestamp to every time_interval. Note: Each row in WeatherTable could be listed more than once, if the interval used is less than half the interval of the stored data (or something like that, you get the point ;)).

Note: I did not test the queries, they are written from my head. Please adjust to your use-case and correct minor mistakes, that might be in there...

like image 95
wolfgangwalther Avatar answered Apr 30 '26 23:04

wolfgangwalther


For testing purposes, I extended your dataset to the following timestamps. The column in my database is called time_stamp.

2014-10-27 14:24:24
2014-10-27 14:24:26
2014-10-27 14:24:28
2014-10-27 14:24:32
2014-10-27 14:24:34
2014-10-27 14:24:25
2014-10-27 14:24:32
2014-10-27 14:24:34
2014-10-27 14:24:36
2014-10-27 14:24:37
2014-10-27 14:24:39
2014-10-27 14:24:44
2014-10-27 14:24:47
2014-10-27 14:24:53

I've summarized the idea, but let me explain in more detail before providing the solution I was able to work out.

The requirements are to address timestamps +/- a given time. Since we must go in either direction, we'll want to take the timeframe and split it in half. Then, -1/2 of the timeframe to +1/2 of the timeframe defines a "bin" to consider.

The bin for a given time from a given start time in an interval of @seconds is then given by this MySQL statement:

((floor(((t1.time_stamp - @time_start) - (@seconds/2))/@seconds) + 1) * @seconds)

NOTE: The whole + 1 trick is there so that we do not end up with bin of -1 index (it'll start at zero). All times are calculated from the start time to ensure timeframes of >=60 seconds work.

Within each bin, we will need to know the magnitude of the distance from the center of the bin for each timeframe. That's done by determining the number of seconds from start and subtracting it from the bin (then taking the absolute value).

At this stage we then have all times "binned up" and ordered within the bin.

To filter out these results, we LEFT JOIN to the same table and setup the conditions to remove the undesirable rows. When LEFT JOINed, the desirable rows will have a NULL match in the LEFT JOINed table.

I have rather hack-like replaced the start, end, and seconds with variables, but only for readability. MySQL-style comments are included in the LEFT JOIN ON clause identifying the conditions.

SET @seconds = 7;
SET @time_start = TIMESTAMP('2014-10-27 14:24:24');
SET @time_end = TIMESTAMP('2014-10-27 14:24:52');

SELECT t1.*
FROM temp t1
LEFT JOIN temp t2 ON
  #Condition 1: Only considering rows in the same "bin"
  ((floor(((t1.time_stamp - @time_start) - (@seconds/2))/@seconds) + 1) * @seconds)
 = ((floor(((t2.time_stamp - @time_start) - (@seconds/2))/@seconds) + 1) * @seconds)
AND
(
  #Condition 2 (Part A): "Filter" by removing rows which are greater from the center of the bin than others
  abs(
      (t1.time_stamp - @time_start)
      - (floor(((t1.time_stamp - @time_start) - (@seconds/2))/@seconds) + 1) * @seconds
  )
  > 
  abs(
      (t2.time_stamp - @time_start)
      - (floor(((t2.time_stamp - @time_start) - (@seconds/2))/@seconds) + 1) * @seconds
  )
  OR
  #Condition 2 (Part B1): "Filter" by removing rows which are the same distance from the center of the bin
  (
    abs(
        (t1.time_stamp - @time_start)
        - (floor(((t1.time_stamp - @time_start) - (@seconds/2))/@seconds) + 1) * @seconds
    )
    =
    abs(
        (t2.time_stamp - @time_start)
        - (floor(((t2.time_stamp - @time_start) - (@seconds/2))/@seconds) + 1) * @seconds
    )
    #Condition 2 (Part B2): And are in the future from the other match
    AND
      (t1.time_stamp - @time_start)
      >
      (t2.time_stamp - @time_start)
  )
)
WHERE t1.time_stamp - @time_start >= 0
AND @time_end - t1.time_stamp >= 0
#Condition 3: All rows which have a match are undesirable, so those 
#with a NULL for the primary key (in this case temp_id) are selected
AND t2.temp_id IS NULL

There may be a more succinct way to write the query, but it did filter the results down to what was needed with one notable exception -- I purposefully put in a duplicate entry. This query will return both such entries as they do meet the criteria as stated.


Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!