Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

LEFT JOIN ON most recent date in Google BigQuery

I've got two tables, both with timestamps and some more data:

Table A

| name | timestamp           | a_data |
| ---- | ------------------- | ------ |
| 1    | 2018-01-01 11:10:00 | a      |
| 2    | 2018-01-01 12:20:00 | b      |
| 3    | 2018-01-01 13:30:00 | c      |

Table B

| name | timestamp           | b_data |
| ---- | ------------------- | ------ |
| 1    | 2018-01-01 11:00:00 | w      |
| 2    | 2018-01-01 12:00:00 | x      |
| 3    | 2018-01-01 13:00:00 | y      |
| 3    | 2018-01-01 13:10:00 | y      |
| 3    | 2018-01-01 13:10:00 | z      |

What I want to do is

  1. For each row in Table A LEFT JOIN the most recent record in Table B that predates it.
  2. When there is more than one possibility take the last one

Target Result

| name | timestamp           | a_data | b_data |
| ---- | ------------------- | ------ | ------ |
| 1    | 2018-01-01 11:10:00 | a      | w      |
| 2    | 2018-01-01 12:20:00 | b      | x      |
| 3    | 2018-01-01 13:30:00 | c      | z      | <-- note z, not y

I think this involves a subquery, but I cannot get this to work in Big Query. What I have so far:

SELECT a.a_data, b.b_data
FROM `table_a` AS  a  

LEFT JOIN `table_b` AS b 
ON a.name = b.name

WHERE a.timestamp = (
  SELECT max(timestamp) from `table_b` as sub
  WHERE sub.name = b.name
  AND sub.timestamp < a.timestamp
)

On my actual dataset, which is a very small test set (under 2Mb) the query runs but never completes. Any pointers much appreciated 👍🏻

like image 961
wab Avatar asked Oct 16 '25 17:10

wab


2 Answers

You can try to use a select subquery.

SELECT a.*,(
    SELECT MAX(b.b_data) 
    FROM `table_b` AS b 
    WHERE 
        a.name = b.name 
    and  
        b.timestamp < a.timestamp
) b_data 
FROM `table_a` AS  a

EDIT

Or you can try to use ROW_NUMBER window function in a subquery.

SELECT name,timestamp,a_data , b_data
FROM (
    SELECT a.*,b.b_data,ROW_NUMBER() OVER(PARTITION BY a.name ORDER BY b.timestamp desc,b.name desc) rn 
    FROM `table_a` AS  a  
    LEFT JOIN `table_b` AS b ON a.name = b.name AND b.timestamp < a.timestamp
) t1
WHERE rn = 1
like image 78
D-Shih Avatar answered Oct 19 '25 11:10

D-Shih


Below is for BigQuery Standard SQL and does not require specifying all columns on both sides - only name and timestamp. So it will work for any number of the columns in both tables (assuming no ambiguity in name rather than for above mentioned two columns)

#standardSQL
SELECT a.*, b.* EXCEPT (name, timestamp)
FROM (
  SELECT 
    ANY_VALUE(a) a, 
    ARRAY_AGG(b ORDER BY b.timestamp DESC LIMIT 1)[SAFE_OFFSET(0)] b
  FROM `project.dataset.table_a` a
  LEFT JOIN `project.dataset.table_b` b
  USING (name)
  WHERE a.timestamp > b.timestamp
  GROUP BY TO_JSON_STRING(a)
)
like image 40
Mikhail Berlyant Avatar answered Oct 19 '25 11:10

Mikhail Berlyant