Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Bigquery how delete records from dataflow

I need to update and delete record(s) in BigQuery from a Dataflow. The data comes from Pubsub and comes with a flag that identifies the action Insert, Update, Delete (I, U, D). Insert is not a problem.

Any suggestion for update and delete?

like image 418
user3066281 Avatar asked Dec 18 '25 07:12

user3066281


1 Answers

The IOs in Dataflow prefer to do blind writes (ideally to a separate table), so the input is preserved. This gives a more stable, high performance experience and prevents altering the pipeline input.

Inserts, of course can be performed performed simply using BigQueryIO.Write

For updates, if you simply want to overwrite a previous row in the table. Then you can just write a TableRow using BigQueryIO.Write.

If by want to Read, Modify, Write a row (i.e. Increment the value in a row by X). Then this will involve reading in the data as well. This is more complicated. Here are a few approaches for this (For the purposes of this, you can also think of a delete as a kind of update. I.e. a row may be added to indicate that the all entries for that "key" should be discarded).

Approach 1 Write every update to its own row

See the answer here, you simply write every update (without a read) to a row and compute an aggregated value when you perform the SQL query. Depending on the updates you want, this may perform well with your SQL BQ query.

Google Dataflow: insert + update in BigQuery in a streaming pipeline

Approach 2 Use Dataflow Streaming State

With this approach, you can create a StatefulParDo which essentially allows you to store some persistent state within Dataflow for every Dataflow key, for a particular ParDo. So you could keep a value here, keep updating it based on your pubsub events, and emit an element which overwrites a row in an output BigQueryTable with a new TableRow.

If you have an existing BQ table to may need to initialize the StaefulParDo using both BigQueryIO.Read and PubsubIO.Read.

There is some pricing associated with this dataflow streaming state to be aware of with this approach.

When you update a dataflow pipeline this data can is also retained.

Approach 3 Write a journal table and compute the final table you want

In this approach, you can write an entry in a table for every event (regardless if it is an Insert, Update or Delete.

Then have another pipeline read in that BigQuery table, aggregate all rows with the same "key". Though you will need to use a window to aggregate the data. So you would be able to compute an up to date entry for all the events for the key in the same hour, day, etc. Using that window.

Approach 4 Use BigQuery APIs directly(Not recommended)

Write a custom ParDo that manually reads, modifies and writes each BigQuery Update using the BigQuery APIs directly from a ParDo (Instead of using BigqueryIO).

Not recommended as making the RPCs to read modify and write each row will be very slow, due to the RPC round trip time. This will can produce side effects that are unwanted as well, i.e. if you end up reprocessing data on dataflow's retries, the updates could be applied multiple times. But this could work if you are okay with inaccuracy and have a low volume of data.

Alternative approaches to this include: Using a ParDo with streaming state (This will allow you to update a value for the key in the Dataflow pipeline)

like image 141
Alex Amato Avatar answered Dec 21 '25 07:12

Alex Amato



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!