Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Initial state for a dataflow job

I'm trying to figure out how we "seed" the window state for some of our streaming dataflow jobs. Scenario is we have a stream of forum messages, we want to emit a running count of messages for each topic for all time, so we have a streaming dataflow job with a global window and triggers to emit each time a record for a topic comes in. All good so far. But prior to the stream source, we have a large file which we'd like to process to get our historical counts, also, because topics live forever, we need the historical count to inform the outputs from the stream source, so we kind've need the same logic to run over the file, then start running over the stream source when the file is exhausted, while keeping the window state.

Current ideas:

  • Write a custom unbounded source that does just that. Reads over the file until it's exhausted and then starts reading from the stream. Not much fun because writing custom sources is not much fun.
  • Run the logic in batch mode over the file, and as the last step emit the state to a stream sink somehow, then have a streaming version of the logic start up that reads from both the state stream and the data stream, and somehow combines the two. This seems to make some sense, but not sure how to make sure that the streaming job reads everything from the state source, to initialise, before reading from the data stream.
  • Pipe the historical data into a stream, write a job that reads from both the streams. Same problems as the second solution, not sure how to make sure one stream is "consumed" first.

EDIT: Latest option, and what we're going with, is to write the calculation job such that it doesn't matter at all what order the events arrive in, so we'll just push the archive to the pub/sub topic and it will all work. That works in this case, but obviously it affects the downstream consumer (need to either support updates or retractions) so I'd be interested to know what other solutions people have for seeding their window states.

like image 265
bfabry Avatar asked Feb 08 '16 20:02

bfabry


People also ask

What is temp location in Dataflow?

temp_location : a Cloud Storage path for Dataflow to stage temporary job files created during the execution of the pipeline.

Is Dataflow Regional?

Regional availabilityYou can use Dataflow workers, regional endpoints, Streaming Engine, Shuffle, and FlexRS in supported regions.

What are stages in Dataflow?

The Execution details tab includes four views: Stage progress, Stage info panel (within Stage progress), Stage workflow, and Worker progress. This section walks you through each view and provides examples of successful and unsuccessful Dataflow jobs.


1 Answers

You can do what you suggested in bullet point 2 --- run two pipelines (in the same main), with the first that populates a pubsub topic from the large file. This is similar to what the StreamingWordExtract example does.

like image 78
Tudor Marian Avatar answered Oct 25 '22 17:10

Tudor Marian