I try to merge two tables with different time resolution on their closest date.
The tables are like this:
Table1:
id | date | device | value1
----------------------------------
1 | 10:22 | 13 | 0.53
2 | 10:24 | 13 | 0.67
3 | 10:25 | 14 | 0.83
4 | 10:25 | 13 | 0.32
Table2:
id | date | device | value2
----------------------------------
22 | 10:18 | 13 | 0.77
23 | 10:21 | 14 | 0.53
24 | 10:23 | 13 | 0.67
25 | 10:28 | 14 | 0.83
26 | 10:31 | 13 | 0.23
I want to merge these tables along the first one. So I want to append value2 to Table1, where, for each device, the latest value2 appears.
Result:
id | date | device | value1 | value2
-------------------------------------------
1 | 10:22 | 13 | 0.53 | 0.77
2 | 10:24 | 13 | 0.67 | 0.67
3 | 10:25 | 14 | 0.83 | 0.53
4 | 10:25 | 13 | 0.32 | 0.67
I have some (20-30) devices, thousands of rows in Table2 (=m) and millions of them in Table1 (=n).
I could sort all the tables along date (O(n*logn)
), write them into a text file and iterate over Table1 like a merge, while pulling data from Table2 until it is newer (I have to manage that ~20-30 pointers to the latest data for each device, but no more), and after the merge I could upload it back to the database. Then the complexities are O(n*log(n))
for sorting and O(n+m)
for iterating over the tables.
But it would be much better to do it in the database at all. But the best query I could achive was O(n^2) complexity:
SELECT DISTINCT ON (Table1.id)
Table1.id, Table1.date, Table1.device, Table1.value1, Table2.value2
FROM Table1, Table2
WHERE Table1.date > Table2.date and Table1.device = Table2.device
ORDER BY Table1.id, Table1.date-Table2.date;
It's really slow for the data amount I need to process, are there better ways to do this? Or just do that stuff with the downloaded data?
Your query can be rewritten as:
SELECT DISTINCT ON (t1.id)
t1.id, t1.date, t1.device, t1.value1, t2.value2
FROM table1 t1
JOIN table2 t2 USING (device)
WHERE t1.date > t2.date
ORDER BY t1.id, t2.date DESC;
No need to calculate a date difference for every combination of rows (which is expensive and not sargable), just pick the row with the greatest t2.date
from each set. Index support is advisable.
Details for DISTINCT ON
:
That's probably not fast enough, yet. Given your data distribution you would need a loose index scan, which can be emulated with correlated subqueries (like Gordon's query) or a more modern and versatile JOIN LATERAL
:
SELECT t1.id, t1.date, t1.device, t1.value1, t2.value2
FROM table1 t1
LEFT JOIN LATERAL (
SELECT value2
FROM table2
WHERE device = t1.device
AND date < t1.date
ORDER BY date DESC
LIMIT 1
) t2 ON TRUE;
The LEFT JOIN
avoids losing rows when no match is found in t2
. Details:
But that's still not very fast, since you have "thousands of rows in Table2 and millions of them in Table1".
Two ideas, probably faster, but also more complex:
UNION ALL
plus window functionsCombine Table1
and Table2
in a UNION ALL
query and run a window function over the derived table. This is enhanced by the "moving aggregate support" in Postgres 9.4 or later.
SELECT id, date, device, value1, value2
FROM (
SELECT id, date, device, value1
, min(value2) OVER (PARTITION BY device, grp) AS value2
FROM (
SELECT *
, count(value2) OVER (PARTITION BY device ORDER BY date) AS grp
FROM (
SELECT id, date, device, value1, NULL::numeric AS value2
FROM table1
UNION ALL
SELECT id, date, device, NULL::numeric AS value1, value2
FROM table2
) s1
) s2
) s3
WHERE value1 IS NOT NULL
ORDER BY date, id;
You'll have to test if it can compete. Sufficient work_mem
allows in-memory sorting.
db<>fiddle here for all three queries
Old sqlfiddle
Cursor for each device in Table2
, loop over Table1
, pick the value from respective device-cursor after advancing until cursor.date > t1.date
and keeping value2
from the row before last. Similar to the winning implementation here:
Probably fastest, but more code to write.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With