As I keep appending rows to BigQuery, I'd like to have a "materialized view" of the latest row for each id.
How can I do that?
2022 UPDATE: the answer might be OUTDATED given it was given before Materialized view release in 2018 https://cloud.google.com/bigquery/docs/materialized-views-intro
The following solution involves Scheduled Queries
, still working as of 2022-03
2018-10: BigQuery doesn't support materialized views, but you can use this approach:
Let's say you want a table with the latest info for each row, and you want to keep it updated - so anyone querying can easily access the latest row without having to scan the whole append-only table.
For this example I'll use my Wikipedia clustered logs - and I'll create a table with the latest rows of all English pages that start with 'A'. These restrictions make my queries faster and smaller for this demo purposes.
Let's create the table first:
CREATE TABLE `wikipedia_vt.just_latest_rows` AS
SELECT latest_row.*
FROM (
SELECT ARRAY_AGG(a ORDER BY datehour DESC LIMIT 1)[OFFSET(0)] latest_row
FROM `fh-bigquery.wikipedia_v3.pageviews_2018` a
WHERE datehour BETWEEN "2018-10-18" AND "2018-10-21"
AND wiki='en' AND title LIKE 'A%'
GROUP BY title
)
And now I want to update it with all the new rows received since that date:
MERGE `wikipedia_vt.just_latest_rows` T
# our "materialized view"
USING (
SELECT latest_row.*
FROM (
SELECT ARRAY_AGG(a ORDER BY datehour DESC LIMIT 1)[OFFSET(0)] latest_row
FROM `fh-bigquery.wikipedia_v3.pageviews_2018` a
WHERE datehour > TIMESTAMP_SUB(@run_time, INTERVAL 1 DAY )
# change to CURRENT_TIMESTAMP() or let scheduled queries do it
AND datehour > '2000-01-01' # nag
AND wiki='en' AND title LIKE 'A%'
GROUP BY title
)
) S
ON T.title = S.title
WHEN MATCHED THEN
# if the row is there, we update the views and time
UPDATE SET views = S.views, datehour=S.datehour
WHEN NOT MATCHED BY TARGET THEN
# if the row is not there, we insert it
INSERT (datehour, wiki, title, views) VALUES (datehour, wiki, title, views)
Now you should set up a process to run this query periodically. To keep querying costs down, make sure the process changes the starting date for updates.
A simple way to set up this process is to use the new BigQuery Scheduled Queries, which will replace @run_time with the current timestamp.
To create a view that combines this approach with a real-time view of the latest records, see:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With