We are using the following code to write the records to BigQuery:
BigQueryIO.writeTableRows()
.to("table")
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withSchema(schema);
With this code, when we do a backfill, some of the records get sent to this dataflow again, resulting in duplicates in the BigQuery table. Is there any way to configure an upsert
operation based on the field name in the dataflow?
I had a very similar use case and we decided to solve this issue by creating a view on top of the table which deduplicates the data, and point anything that was accessing the original table to now query from this view instead.
BigQuery deduplication and partitioned table is a very good reference. As is https://wecode.wepay.com/posts/bigquery-wepay
Essentially you need to make sure you have some sort of last_updated column, as well as an id column which uniquely identifies the row. Then you can create a view that gets all the data for the most recent version of each id. There may be some small performance loss because you are querying a view, but we found it to be negligible.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With