What does reshuffling, in the context of exactly-once processing in BigQuery sink, mean?

Tags:

I'm reading an article on exactly-once processing implemented by some Dataflow sources and sinks and I'm having troubles understanding the example on BigQuery sink. From the article

Generating a random UUID is a non-deterministic operation, so we must add a reshuffle before we insert into BigQuery. Once that is done, any retries by Cloud Dataflow will always use the same UUID that was shuffled. Duplicate attempts to insert into BigQuery will always have the same insert id, so BigQuery is able to filter them

Click to copy

// Apply a unique identifier to each record
c
 .apply(new DoFn<> {
  @ProcessElement
  public void processElement(ProcessContext context) {
   String uniqueId = UUID.randomUUID().toString();
   context.output(KV.of(ThreadLocalRandom.current().nextInt(0, 50),
                                     new RecordWithId(context.element(), uniqueId)));
 }
})
// Reshuffle the data so that the applied identifiers are stable and will not change.
.apply(Reshuffle.of<Integer, RecordWithId>of())
// Stream records into BigQuery with unique ids for deduplication.
.apply(ParDo.of(new DoFn<..> {
   @ProcessElement
   public void processElement(ProcessContext context) {
     insertIntoBigQuery(context.element().record(), context.element.id());
   }
 });

What does reshuffle mean and how can it prevent generation of different UUID for the same insert on subsequent retries ?

563

asked Sep 26 '18 14:09

MassyB

2 Answers

Reshuffle groups the data in a different way. However, here it is used for its side-effects: checkpointing and deduplication.

Without reshuffle, if the same task generates UUID and inserts data to BigQuery, there is a risk the worker restarts and the new worker would generate a new UUID and sends different row to BigQuery, resulting in duplicate rows.

Reshuffle operation splits UUID generation and BigQuery insert into two steps, and inserts checkpointing and deduplication between them.

First, UUID are generated and sent to reshuffle. If UUID generation worker is restarted, it is OK, as reshuffle deduplicates rows, eliminating data from failed / restarted workers.
The generated UUIDs are checkpointed by the shuffle operation.
BigQuery insert worker uses checkpointed UUIDs, so even if it is restarted - it sends exactly the same data to BigQuery.
BigQuery deduplicates data using these UUIDs, so duplicates from restarted insert worker are eliminated in BigQuery.

answered Oct 21 '22 18:10

Michael Entin

I think the article provides a good explanation on why "reshuffle" helps moving from "at least once" to "exactly once":

Specifically, the window might attempt to fire with element e0, e1, e2, but the worker crashes before committing the window processing (but not before those elements are sent as a side effect). When the worker restarts the window will fire again, but now a late element e3 shows up. Since this element shows up before the window is committed, it’s not counted as late data, so the DoFn is called again with elements e0, e1, e2, e3. These are then sent to the side-effect operation. Idempotency does not help here, as different logical record sets were sent each time.

There are other ways non-determinism can be introduced. The standard way to address this risk is to rely on the fact that Cloud Dataflow currently guarantees that only one version of a DoFn's output can make it past a shuffle boundary.

You can also check Reshuffle's docs:

https://beam.apache.org/documentation/sdks/javadoc/2.3.0/org/apache/beam/sdk/transforms/Reshuffle.html

There's a note there about deprecating this class, so future implementations of BigQueryIO might differ.

answered Oct 21 '22 16:10

Felipe Hoffa

Related questions
                            
                                BigQuery SQL, append SQL query result to existing table
                            
                                Using BigQuery for logs analysis
                            
                                Can I safely query a BigQuery table being replaced with WRITE_TRUNCATE
                            
                                BigQuery calculate grand totals row
                            
                                Bigquery - write only authorization
                            
                                In BigQuery, is there any way to see all queries across all users?
                            
                                Is it possible to dynamically generate BigQuery table names based on the timestamps of the elements of a window?
                            
                                BigQuery - Loading JSON fields with null values
                            
                                Exporting query results as JSON via Google BigQuery API
                            
                                BigQuery Pivot Data Rows Columns [duplicate]
                            
                                how to implement RATIO_TO_REPORT() in standard SQL in BigQuery?
                            
                                BigQuery async query job - the fetch_results() method returns wrong number of values
                            
                                Automate file upload from Google Cloud Storage to Bigquery
                            
                                Google BigQuery APPROX_QUANTILES and getting true quartiles
                            
                                How can I get the list of all columns in a BigQuery table and dataset
                            
                                BigQuery UDF in Python or only in JavaScript
                            
                                BigQuery SQL: Average, geometric mean, remove outliers, median
                            
                                Apache Beam/Google Dataflow PubSub to BigQuery Pipeline: Handling Insert Errors and Unexpected Retry Behavior
                            
                                How to map each parameter in firebase analytics sql to a separate column?
                            
                                How do I use the TABLE_QUERY() function in BigQuery?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What does reshuffling, in the context of exactly-once processing in BigQuery sink, mean?

Tags:

google-bigquery

apache-beam

dataflow

MassyB

People also ask

2 Answers

Michael Entin

Felipe Hoffa

Recent Activity

Donate For Us