Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merging two data sets on closest date efficiently in PostgreSQL

I try to merge two tables with different time resolution on their closest date.

The tables are like this:

Table1:

id    | date    | device  | value1
----------------------------------
1     | 10:22   | 13      | 0.53
2     | 10:24   | 13      | 0.67
3     | 10:25   | 14      | 0.83
4     | 10:25   | 13      | 0.32

Table2:

id    | date    | device  | value2
----------------------------------
22    | 10:18   | 13      | 0.77
23    | 10:21   | 14      | 0.53
24    | 10:23   | 13      | 0.67
25    | 10:28   | 14      | 0.83
26    | 10:31   | 13      | 0.23

I want to merge these tables along the first one. So I want to append value2 to Table1, where, for each device, the latest value2 appears.

Result:

id    | date    | device  | value1 | value2
-------------------------------------------
1     | 10:22   | 13      | 0.53   | 0.77
2     | 10:24   | 13      | 0.67   | 0.67
3     | 10:25   | 14      | 0.83   | 0.53
4     | 10:25   | 13      | 0.32   | 0.67

I have some (20-30) devices, thousands of rows in Table2 (=m) and millions of them in Table1 (=n).

I could sort all the tables along date (O(n*logn)), write them into a text file and iterate over Table1 like a merge, while pulling data from Table2 until it is newer (I have to manage that ~20-30 pointers to the latest data for each device, but no more), and after the merge I could upload it back to the database. Then the complexities are O(n*log(n)) for sorting and O(n+m) for iterating over the tables.

But it would be much better to do it in the database at all. But the best query I could achive was O(n^2) complexity:

SELECT DISTINCT ON (Table1.id)
       Table1.id, Table1.date, Table1.device, Table1.value1, Table2.value2
FROM Table1, Table2
WHERE Table1.date > Table2.date and Table1.device = Table2.device
ORDER BY Table1.id, Table1.date-Table2.date;

It's really slow for the data amount I need to process, are there better ways to do this? Or just do that stuff with the downloaded data?

like image 501
hunyadym Avatar asked Nov 08 '14 17:11

hunyadym


1 Answers

Your query can be rewritten as:

SELECT DISTINCT ON (t1.id)
       t1.id, t1.date, t1.device, t1.value1, t2.value2
FROM   table1 t1
JOIN   table2 t2 USING (device)
WHERE  t1.date > t2.date
ORDER  BY t1.id, t2.date DESC;

No need to calculate a date difference for every combination of rows (which is expensive and not sargable), just pick the row with the greatest t2.date from each set. Index support is advisable. Details for DISTINCT ON:

  • Select first row in each GROUP BY group?

That's probably not fast enough, yet. Given your data distribution you would need a loose index scan, which can be emulated with correlated subqueries (like Gordon's query) or a more modern and versatile JOIN LATERAL:

SELECT t1.id, t1.date, t1.device, t1.value1, t2.value2
FROM   table1 t1
LEFT   JOIN LATERAL (
   SELECT value2
   FROM   table2
   WHERE  device = t1.device
   AND    date   < t1.date
   ORDER  BY date DESC
   LIMIT  1
   ) t2 ON TRUE;

The LEFT JOIN avoids losing rows when no match is found in t2. Details:

  • Optimize GROUP BY query to retrieve latest row per user

But that's still not very fast, since you have "thousands of rows in Table2 and millions of them in Table1".

Two ideas, probably faster, but also more complex:

1. UNION ALL plus window functions

Combine Table1 and Table2 in a UNION ALL query and run a window function over the derived table. This is enhanced by the "moving aggregate support" in Postgres 9.4 or later.

SELECT id, date, device, value1, value2
FROM  (
   SELECT id, date, device, value1
        , min(value2) OVER (PARTITION BY device, grp) AS value2
   FROM  (
      SELECT *
           , count(value2) OVER (PARTITION BY device ORDER BY date) AS grp
      FROM  (
         SELECT id, date, device, value1, NULL::numeric AS value2 
         FROM   table1

         UNION  ALL
         SELECT id, date, device, NULL::numeric AS value1, value2
         FROM   table2
         ) s1
      ) s2
   ) s3
WHERE  value1 IS NOT NULL
ORDER  BY date, id;

You'll have to test if it can compete. Sufficient work_mem allows in-memory sorting.

db<>fiddle here for all three queries
Old sqlfiddle

2. PL/pgSQL function

Cursor for each device in Table2, loop over Table1, pick the value from respective device-cursor after advancing until cursor.date > t1.date and keeping value2 from the row before last. Similar to the winning implementation here:

  • Window Functions or Common Table Expressions: count previous rows within range

Probably fastest, but more code to write.

like image 113
Erwin Brandstetter Avatar answered Oct 21 '22 12:10

Erwin Brandstetter