I am trying to create a streaming dataflow pipeline. I need to access some additional data for processing main input received from pubsub. If I use side inputs then after some time period (like 1 day) I need to update the cached side input data. Or Is there any way to pass additional data to transformations in the form of list or map so that I can use third party cache manager to refresh this data.
Thanks.
Here are some possibilities:
If you can express the changes to your side input as an unbounded PCollection of changes - for example by subscribing to a change notification topic - then you should be able to add the updates to the PCollection that you are viewing as a side input, as long as you are tolerant of stale data. The update and caching of side inputs has no defined latency on updates, but will certainly be less than a day.
If you do not have a change notification topic, you can still write your own UnboundedSource that Dataflow will poll for updates. This can be a bit more complex.
Similar to an UnboundedSource, in Apache Beam (incubating) there is active work on supporting timers with callbacks in a DoFn, so you might like to follow BEAM-27.
You could also drive your own cache on the incoming main input data.
Depending on other details, there may be other approaches. You may need to think about how to "wait" for the new value to be ready. For a particular window, any main input will wait for the side input to have a value for that window. But once the side input has one value, there is no more waiting, but just a best-effort eventual update.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With