Apache beam : Update BigQuery table row with BigQueryIO

Question

We are using the following code to write the records to BigQuery:

BigQueryIO.writeTableRows()
    .to("table")
    .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
    .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
    .withSchema(schema);

With this code, when we do a backfill, some of the records get sent to this dataflow again, resulting in duplicates in the BigQuery table. Is there any way to configure an upsert operation based on the field name in the dataflow?

gilmatic · Accepted Answer

I had a very similar use case and we decided to solve this issue by creating a view on top of the table which deduplicates the data, and point anything that was accessing the original table to now query from this view instead.

BigQuery deduplication and partitioned table is a very good reference. As is https://wecode.wepay.com/posts/bigquery-wepay

Essentially you need to make sure you have some sort of last_updated column, as well as an id column which uniquely identifies the row. Then you can create a view that gets all the data for the most recent version of each id. There may be some small performance loss because you are querying a view, but we found it to be negligible.

Apache beam : Update BigQuery table row with BigQueryIO

Tags:

google-cloud-platform

google-bigquery

google-cloud-dataflow

Darshan Mehta

1 Answers

gilmatic

Recent Activity

Donate For Us

Apache beam : Update BigQuery table row with BigQueryIO

Tags:

google-cloud-platform

google-bigquery

google-cloud-dataflow

Darshan Mehta

1 Answers

gilmatic

Related questions

Recent Activity

Donate For Us