Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

I want a "materialized view" of the latest records

As I keep appending rows to BigQuery, I'd like to have a "materialized view" of the latest row for each id.

How can I do that?

like image 529
Felipe Hoffa Avatar asked Oct 25 '18 16:10

Felipe Hoffa


1 Answers

2022 UPDATE: the answer might be OUTDATED given it was given before Materialized view release in 2018 https://cloud.google.com/bigquery/docs/materialized-views-intro The following solution involves Scheduled Queries, still working as of 2022-03

2018-10: BigQuery doesn't support materialized views, but you can use this approach:

Let's say you want a table with the latest info for each row, and you want to keep it updated - so anyone querying can easily access the latest row without having to scan the whole append-only table.

For this example I'll use my Wikipedia clustered logs - and I'll create a table with the latest rows of all English pages that start with 'A'. These restrictions make my queries faster and smaller for this demo purposes.

Let's create the table first:

CREATE TABLE `wikipedia_vt.just_latest_rows` AS
SELECT latest_row.* 
FROM (
  SELECT ARRAY_AGG(a ORDER BY datehour DESC LIMIT 1)[OFFSET(0)] latest_row
  FROM `fh-bigquery.wikipedia_v3.pageviews_2018` a
  WHERE datehour BETWEEN "2018-10-18" AND "2018-10-21" 
  AND wiki='en' AND title LIKE 'A%'
  GROUP BY title
)

And now I want to update it with all the new rows received since that date:

MERGE `wikipedia_vt.just_latest_rows` T
# our "materialized view"
USING  (
  SELECT latest_row.* 
  FROM (
    SELECT ARRAY_AGG(a ORDER BY datehour DESC LIMIT 1)[OFFSET(0)] latest_row
    FROM `fh-bigquery.wikipedia_v3.pageviews_2018` a
    WHERE datehour > TIMESTAMP_SUB(@run_time, INTERVAL 1 DAY )
    # change to CURRENT_TIMESTAMP() or let scheduled queries do it
    AND datehour > '2000-01-01' # nag
    AND wiki='en' AND title LIKE 'A%'
    GROUP BY title
  )
) S
ON T.title = S.title

WHEN MATCHED THEN
  # if the row is there, we update the views and time
  UPDATE SET views = S.views, datehour=S.datehour

WHEN NOT MATCHED BY TARGET THEN
  # if the row is not there, we insert it 
  INSERT (datehour, wiki, title, views) VALUES (datehour, wiki, title, views)

Now you should set up a process to run this query periodically. To keep querying costs down, make sure the process changes the starting date for updates.

A simple way to set up this process is to use the new BigQuery Scheduled Queries, which will replace @run_time with the current timestamp.

To create a view that combines this approach with a real-time view of the latest records, see:

  • I want "live materialized views", with the latest info for any row

enter image description here

like image 68
Felipe Hoffa Avatar answered Nov 13 '22 03:11

Felipe Hoffa