My challenge is to find pairs of rows that are adjacent by timestamp and keep only those pairs with minimal distance of a value field (positive values of the difference)
A table measurement
collects data from different sensors with a timestamp and a value.
id | sensor_id | timestamp | value
---+-----------+-----------+------
1 | 1 | 12:00:00 | 5
2 | 2 | 12:01:00 | 6
3 | 1 | 12:02:00 | 4
4 | 2 | 12:02:00 | 7
5 | 2 | 12:03:00 | 3
6 | 1 | 12:05:00 | 3
7 | 2 | 12:06:00 | 4
8 | 2 | 12:07:00 | 5
9 | 1 | 12:08:00 | 6
A sensor's value is valid from its timestamp until the timestamp of its next record (same sensor_id).
The lower green line shows the distance of sensor 1's (blue line) and sensor 2's (red line) values over time.
My aim is
The real table resides in a PostgreSQL database and contains about 5 million records of 15 sensors.
create table measurement (
id serial,
sensor_id integer,
timestamp timestamp,
value integer)
;
insert into measurement (sensor_id, timestamp, value)
values
(1, '2020-08-16 12:00:00', 5),
(2, '2020-08-16 12:01:00', 6),
(1, '2020-08-16 12:02:00', 4),
(2, '2020-08-16 12:02:00', 7),
(2, '2020-08-16 12:03:00', 3),
(1, '2020-08-16 12:05:00', 3),
(2, '2020-08-16 12:06:00', 4),
(2, '2020-08-16 12:07:00', 5),
(1, '2020-08-16 12:08:00', 6)
;
was to pick 2 arbitrary sensors (by certain sensor_ids), make a self join and retain for any sensor 1's record only that record of the sensor 2 with the previous timestamp (biggest timestamps of sensor 2 with sensor 1's timestamp <= sensor 2's timestamp).
select
*
from (
select
*,
row_number() over (partition by m1.timestamp order by m2.timestamp desc) rownum
from measurement m1
join measurement m2
on m1.sensor_id <> m2.sensor_id
and m1.timestamp >= m2.timestamp
--arbitrarily sensor_ids 1 and 2
where m1.sensor_id = 1
and m2.sensor_id = 2
) foo
where rownum = 1
union --vice versa
select
*
from (
select
*,
row_number() over (partition by m2.timestamp order by m1.timestamp desc) rownum
from measurement m1
join measurement m2
on m1.sensor_id <> m2.sensor_id
and m1.timestamp <= m2.timestamp
--arbitrarily sensor_ids 1 and 2
where m1.sensor_id = 1
and m2.sensor_id = 2
) foo
where rownum = 1
;
But that returns a pair with 12:00:00
where sensor 2 has no data (not a big problem)
and on the real table the statement execution does not end after hours (big problem).
I found certain similar questions but they don't match my problem
Thanks in advance!
The TOP clause in SQL Server is used to control the number or percentage of rows from the result. And to select the records from the last, we have to arrange the rows in descending order. We can use the following script to implement the task and achieve the desired result.
Syntax and Parameters: The basic syntax of “timestamp” data type in SQL is as follows : A valid timestamp data expression consists of a date and a time, followed by an optional BC or AD. In this case, a valid timestamp data expression consists of a date and a time, followed by a time_zone expression concatenated with a ‘+/-’ sign based on ...
In SQL Server, we can easily select the last 10 records from a table by using the “SELECT TOP” statement. The TOP clause in SQL Server is used to control the number or percentage of rows from the result.
A few functions like EXTRACT in SQL let us extract a specific piece of information from the timestamp. For example, we can extract DAY, MONTH, YEAR, HOUR, MINUTE, SECONDS, etc., from the timestamp. In the following examples, we have tried to extract DAY and MONTH from the timestamp.
The first step is to calculate the difference at each timestamp. One method uses a lateral join and conditional aggregation:
select t.timestamp,
max(m.value) filter (where s.sensor_id = 1) as value_1,
max(m.value) filter (where s.sensor_id = 2) as value_2,
abs(max(m.value) filter (where s.sensor_id = 2) -
max(m.value) filter (where s.sensor_id = 1)
) as diff
from (values (1), (2)) s(sensor_id) cross join
(select distinct timestamp
from measurement
where sensor_id in (1, 2)
) t left join lateral
(select m.value
from measurement m
where m.sensor_id = s.sensor_id and
m.timestamp <= t.timestamp
order by m.timestamp desc
limit 1
) m
on 1=1
group by timestamp;
Now the question is when does the difference enter a local minimum. For your sample data, the local minima are all one time unit long. That means that you can use lag()
and lead()
to find them:
with t as (
select t.timestamp,
max(m.value) filter (where s.sensor_id = 1) as value_1,
max(m.value) filter (where s.sensor_id = 2) as value_2,
abs(max(m.value) filter (where s.sensor_id = 2) -
max(m.value) filter (where s.sensor_id = 1)
) as diff
from (values (1), (2)) s(sensor_id) cross join
(select distinct timestamp
from measurement
where sensor_id in (1, 2)
) t left join lateral
(select m.value
from measurement m
where m.sensor_id = s.sensor_id and
m.timestamp <= t.timestamp
order by m.timestamp desc
limit 1
) m
on 1=1
group by timestamp
)
select *
from (select t.*,
lag(diff) over (order by timestamp) as prev_diff,
lead(diff) over (order by timestamp) as next_diff
from t
) t
where (diff < prev_diff or prev_diff is null) and
(diff < next_diff or next_diff is null);
That might not be a reasonable assumption to make. So, filter out adjacent duplicate values before applying this logic:
select *
from (select t.*,
lag(diff) over (order by timestamp) as prev_diff,
lead(diff) over (order by timestamp) as next_diff
from (select t.*, lag(diff) over (order by timestamp) as test_for_dup
from t
) t
where test_for_dup is distinct from diff
) t
where (diff < prev_diff or prev_diff is null) and
(diff < next_diff or next_diff is null)
Here is a db<>fiddle.
You can use a couple of lateral joins. For example:
with
t as (select distinct timestamp as ts from measurement)
select
t.ts, s1.value as v1, s2.value as v2,
abs(s1.value - s2.value) as distance
from t,
lateral (
select value
from measurement m
where m.sensor_id = 1 and m.timestamp <= t.ts
order by timestamp desc
limit 1
) s1,
lateral (
select value
from measurement m
where m.sensor_id = 2 and m.timestamp <= t.ts
order by timestamp desc
limit 1
) s2
order by t.ts
Result:
ts v1 v2 distance
--------------------- -- -- --------
2020-08-16 12:01:00.0 5 6 1
2020-08-16 12:02:00.0 4 7 3
2020-08-16 12:03:00.0 4 3 1
2020-08-16 12:05:00.0 3 3 0
2020-08-16 12:06:00.0 3 4 1
2020-08-16 12:07:00.0 3 5 2
2020-08-16 12:08:00.0 6 5 1
See running example at DB Fiddle.
Also, if you want all timestamps, even unmatched ones like 12:00:00
, you can do:
with
t as (select distinct timestamp as ts from measurement)
select
t.ts, s1.value as v1, s2.value as v2,
abs(s1.value - s2.value) as distance
from t
left join lateral (
select value
from measurement m
where m.sensor_id = 1 and m.timestamp <= t.ts
order by timestamp desc
limit 1
) s1 on true
left join lateral (
select value
from measurement m
where m.sensor_id = 2 and m.timestamp <= t.ts
order by timestamp desc
limit 1
) s2 on true
order by t.ts
In those cases it's not possible to compute the distance, though.
Result:
ts v1 v2 distance
--------------------- -- ------ --------
2020-08-16 12:00:00.0 5 <null> <null>
2020-08-16 12:01:00.0 5 6 1
2020-08-16 12:02:00.0 4 7 3
2020-08-16 12:03:00.0 4 3 1
2020-08-16 12:05:00.0 3 3 0
2020-08-16 12:06:00.0 3 4 1
2020-08-16 12:07:00.0 3 5 2
2020-08-16 12:08:00.0 6 5 1
The infill of missing values requires window functions and a Cartesian product of every minute crossed with your two sensors.
The invars
cte accepts the parameters.
with invars as (
select '2020-08-16 12:00:00'::timestamp as start_ts,
'2020-08-16 12:08:00'::timestamp as end_ts,
array[1, 2] as sensor_ids
),
Create the matrix of minute
x sensor_id
calendar as (
select g.minute, s.sensor_id,
sensor_ids[1] as sid1,
sensor_ids[2] as sid2
from invars i
cross join generate_series(
i.start_ts, i.end_ts, interval '1 minute'
) as g(minute)
cross join unnest(i.sensor_ids) as s(sensor_id)
),
Find mgrp
for every time a new value is available from a sensor_id
gaps as (
select c.minute, c.sensor_id, m.value,
sum(case when m.value is null then 0 else 1 end)
over (partition by c.sensor_id
order by c.minute) as mgrp,
c.sid1, c.sid2
from calendar c
left join measurement m
on m.timestamp = c.minute
and m.sensor_id = c.sensor_id
),
Interpolate missing values by carrying forward the most recent value
interpolated as (
select minute,
sensor_id,
coalesce(
value, first_value(value) over
(partition by sensor_id, mgrp
order by minute)
) as value, sid1, sid2
from gaps
)
Perform the distance
calculation (sum()
could have been max()
or min()
--it makes no difference.
select minute,
sum(value) filter (where sensor_id = sid1) as value1,
sum(value) filter (where sensor_id = sid2) as value2,
abs(
sum(value) filter (where sensor_id = sid1)
- sum(value) filter (where sensor_id = sid2)
) as distance
from interpolated
group by minute
order by minute;
Results:
| minute | value1 | value2 | distance |
| ------------------------ | ------ | ------ | -------- |
| 2020-08-16T12:00:00.000Z | 5 | | |
| 2020-08-16T12:01:00.000Z | 5 | 6 | 1 |
| 2020-08-16T12:02:00.000Z | 4 | 7 | 3 |
| 2020-08-16T12:03:00.000Z | 4 | 3 | 1 |
| 2020-08-16T12:04:00.000Z | 4 | 3 | 1 |
| 2020-08-16T12:05:00.000Z | 3 | 3 | 0 |
| 2020-08-16T12:06:00.000Z | 3 | 4 | 1 |
| 2020-08-16T12:07:00.000Z | 3 | 5 | 2 |
| 2020-08-16T12:08:00.000Z | 6 | 5 | 1 |
---
[View on DB Fiddle](https://www.db-fiddle.com/f/p65hiAFVT4v3TrjTPbrZnC/0)
Please see this working fiddle.
Window functions and checking the neigbors. (you'll need an extra anti-selfjoin to remove the duplicates, and invent a tie-breaker for the stable marriage problem)
SELECT id,sensor_id, ztimestamp,value
-- , prev_ts, next_ts
, (ztimestamp - prev_ts) AS prev_span
, (next_ts - ztimestamp) AS next_span
, (sensor_id <> prev_sensor) AS prev_valid
, (sensor_id <> next_sensor) AS next_valid
, CASE WHEN (sensor_id <> prev_sensor AND sensor_id <> next_sensor) THEN
CASE WHEN (ztimestamp - prev_ts) < (next_ts - ztimestamp) THEN prev_id ELSE next_id END
WHEN (sensor_id <> prev_sensor) THEN prev_id
WHEN (sensor_id <> next_sensor) THEN next_id
ELSE NULL END AS best_neigbor
FROM (
SELECT id,sensor_id, ztimestamp,value
, lag(id) OVER www AS prev_id
, lead(id) OVER www AS next_id
, lag(sensor_id) OVER www AS prev_sensor
, lead(sensor_id) OVER www AS next_sensor
, lag(ztimestamp) OVER www AS prev_ts
, lead(ztimestamp) OVER www AS next_ts
FROM measurement
WINDOW www AS (order by ztimestamp)
) q
ORDER BY ztimestamp,sensor_id
;
Result:
DROP SCHEMA
CREATE SCHEMA
SET
CREATE TABLE
INSERT 0 9
id | sensor_id | ztimestamp | value | prev_span | next_span | prev_valid | next_valid | best_neigbor
----+-----------+---------------------+-------+-----------+-----------+------------+------------+--------------
1 | 1 | 2020-08-16 12:00:00 | 5 | | 00:01:00 | | t | 2
2 | 2 | 2020-08-16 12:01:00 | 6 | 00:01:00 | 00:01:00 | t | t | 3
3 | 1 | 2020-08-16 12:02:00 | 4 | 00:01:00 | 00:00:00 | t | t | 4
4 | 2 | 2020-08-16 12:02:00 | 7 | 00:00:00 | 00:01:00 | t | f | 3
5 | 2 | 2020-08-16 12:03:00 | 3 | 00:01:00 | 00:02:00 | f | t | 6
6 | 1 | 2020-08-16 12:05:00 | 3 | 00:02:00 | 00:01:00 | t | t | 7
7 | 2 | 2020-08-16 12:06:00 | 4 | 00:01:00 | 00:01:00 | t | f | 6
8 | 2 | 2020-08-16 12:07:00 | 5 | 00:01:00 | 00:01:00 | f | t | 9
9 | 1 | 2020-08-16 12:08:00 | 6 | 00:01:00 | | t | | 8
(9 rows)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With