Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache beam : Update BigQuery table row with BigQueryIO

We are using the following code to write the records to BigQuery:

BigQueryIO.writeTableRows()
    .to("table")
    .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
    .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
    .withSchema(schema);

With this code, when we do a backfill, some of the records get sent to this dataflow again, resulting in duplicates in the BigQuery table. Is there any way to configure an upsert operation based on the field name in the dataflow?

like image 288
Darshan Mehta Avatar asked Mar 08 '23 03:03

Darshan Mehta


1 Answers

I had a very similar use case and we decided to solve this issue by creating a view on top of the table which deduplicates the data, and point anything that was accessing the original table to now query from this view instead.

BigQuery deduplication and partitioned table is a very good reference. As is https://wecode.wepay.com/posts/bigquery-wepay

Essentially you need to make sure you have some sort of last_updated column, as well as an id column which uniquely identifies the row. Then you can create a view that gets all the data for the most recent version of each id. There may be some small performance loss because you are querying a view, but we found it to be negligible.

like image 99
gilmatic Avatar answered Mar 20 '23 02:03

gilmatic