I am trying to get some data into dataflow, but the data is not located on cloud storage - it is an rss feed that I would normally check every x hours. Is there a way to do that directly using the SDK or do I have to get the files onto cloud storage some other way first.
Thanks in advance.
Dataflow doesn't provide a source for an RSS feed.
You could issue HTTP requests from a ParDo to fetch the data though. For example suppose the feed allowed you to fetch messages in some time range. Then you could create an input collection where each record represented a range of time (e.g. an hour). You could then write a ParDo which would fetch the messages in that time range and emit them.
If you are part of the streaming early access preview then one solution would be to write an App Engine App (or equivalent) which checked the RSS feed every X hours and then published the data using Google Cloud PubSub. You could then use PubSubIO to read those events in Dataflow.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With