Initial state for a dataflow job

Tags:

google-cloud-dataflow

I'm trying to figure out how we "seed" the window state for some of our streaming dataflow jobs. Scenario is we have a stream of forum messages, we want to emit a running count of messages for each topic for all time, so we have a streaming dataflow job with a global window and triggers to emit each time a record for a topic comes in. All good so far. But prior to the stream source, we have a large file which we'd like to process to get our historical counts, also, because topics live forever, we need the historical count to inform the outputs from the stream source, so we kind've need the same logic to run over the file, then start running over the stream source when the file is exhausted, while keeping the window state.

Current ideas:

Write a custom unbounded source that does just that. Reads over the file until it's exhausted and then starts reading from the stream. Not much fun because writing custom sources is not much fun.
Run the logic in batch mode over the file, and as the last step emit the state to a stream sink somehow, then have a streaming version of the logic start up that reads from both the state stream and the data stream, and somehow combines the two. This seems to make some sense, but not sure how to make sure that the streaming job reads everything from the state source, to initialise, before reading from the data stream.
Pipe the historical data into a stream, write a job that reads from both the streams. Same problems as the second solution, not sure how to make sure one stream is "consumed" first.

EDIT: Latest option, and what we're going with, is to write the calculation job such that it doesn't matter at all what order the events arrive in, so we'll just push the archive to the pub/sub topic and it will all work. That works in this case, but obviously it affects the downstream consumer (need to either support updates or retractions) so I'd be interested to know what other solutions people have for seeding their window states.

265

asked Feb 08 '16 20:02

bfabry

1 Answers

You can do what you suggested in bullet point 2 --- run two pipelines (in the same main), with the first that populates a pubsub topic from the large file. This is similar to what the StreamingWordExtract example does.

answered Oct 25 '22 17:10

Tudor Marian

Related questions
                            
                                Issues with Stateful processing in Apache Beam
                            
                                Apache-Beam + Python: Writing JSON (or dictionaries) strings to output file
                            
                                How can I install a python package onto Google Dataflow and import it into my pipeline?
                            
                                Custom VM images for Google Cloud Dataflow workers
                            
                                Setting Custom Coders & Handling Parameterized types
                            
                                How to use google-cloud-storage directly in a Apache Beam project
                            
                                How do I Filter elements of a PCollection with a ParDo with Apache Beam Python SDK
                            
                                Google Cloud Dataflow v/s Google Cloud Data Fusion
                            
                                How to Get Filename when using file pattern match in google-cloud-dataflow
                            
                                Airflow installation failure beam[gcp]
                            
                                Apache Beam MinimalWordcount example with Dataflow Runner on eclipse
                            
                                join two json in Google Cloud Platform with dataflow
                            
                                How to use Pandas in apache beam?
                            
                                Controlling Dataflow/Apache Beam output sharding
                            
                                Start kubernetes pod memory depending on size of data job
                            
                                Google Cloud Data flow jobs failing with error 'Failed to retrieve staged files: failed to retrieve worker in 3 attempts: bad MD5...'

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With