There has been a question on this topic, the answer said "The acknowledgement will be made once the message is durable persisted somewhere in the Dataflow pipeline.".
Conceptually, that makes sense, but I am not sure how Dataflow is capable of tracking a message after it has been deserialized and transformed in the pipeline before its payload is persisted.
In our case, the PubSub message contains a batch of items. After the message is received and deserialized, we broken down the batch for processing. Eventually, an item in the batch could be either discarded or committed to Datastore depending on its timestamp.
How does the acknowledgement work in this situation?
Tumbling windows (called fixed windows in Apache Beam) Hopping windows (called sliding windows in Apache Beam) Session windows.
Dataflow templates allow you to package a Dataflow pipeline for deployment. Anyone with the correct permissions can then use the template to deploy the packaged pipeline. You can create your own custom Dataflow templates, and Google provides pre-built templates for common scenarios.
Dataflow executes your code in bundles. After successful execution each bundle is committed to avoid re-execution on successfully processed elements. Bundles are not necessarily committed between every step in the pipeline. See the description of fusion optimization for details about when PCollections are materialized and committed.
For PubSub, messages that were read as part of a bundle will be acknowledged as part of committing the completion of that bundle. This means if you look at the PubSub read step, and any ParDo
s after it, these will be executed (and committed) together.
Adding a GroupByKey
after the PubSub
read allows messages to be acknowledged to PubSub
as soon as the bundles are committed to the GroupByKey
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With