I'm trying to figure out how we "seed" the window state for some of our streaming dataflow jobs. Scenario is we have a stream of forum messages, we want to emit a running count of messages for each topic for all time, so we have a streaming dataflow job with a global window and triggers to emit each time a record for a topic comes in. All good so far. But prior to the stream source, we have a large file which we'd like to process to get our historical counts, also, because topics live forever, we need the historical count to inform the outputs from the stream source, so we kind've need the same logic to run over the file, then start running over the stream source when the file is exhausted, while keeping the window state.
Current ideas:
EDIT: Latest option, and what we're going with, is to write the calculation job such that it doesn't matter at all what order the events arrive in, so we'll just push the archive to the pub/sub topic and it will all work. That works in this case, but obviously it affects the downstream consumer (need to either support updates or retractions) so I'd be interested to know what other solutions people have for seeding their window states.
temp_location : a Cloud Storage path for Dataflow to stage temporary job files created during the execution of the pipeline.
Regional availabilityYou can use Dataflow workers, regional endpoints, Streaming Engine, Shuffle, and FlexRS in supported regions.
The Execution details tab includes four views: Stage progress, Stage info panel (within Stage progress), Stage workflow, and Worker progress. This section walks you through each view and provides examples of successful and unsuccessful Dataflow jobs.
You can do what you suggested in bullet point 2 --- run two pipelines (in the same main), with the first that populates a pubsub topic from the large file. This is similar to what the StreamingWordExtract example does.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With