I've been reading about Storm and playing around with the examples from storm-starter.
I think I got the concept and it applies very well to many cases. I have a test project I want to do to learn more about this, but I'm wondering if Storm is really suited for this.
The conceptual problem I'm having is with the 'streaming' definition. It seems that Storms will work as a charm subscribing to a stream and processing it in real time, but I don't really have a stream, but rather a finite collection of data that I want to process.
I know there's hadoop for this, but I'm interested in the real time capabilities of Storm as well as other interesting points that Nathan, who wrote Storm, mentions in his talks.
So I was wondering, do people write Spouts that poll non streaming APIs and then diff the results maybe to emulate a stream?
The second important point is, it seems that Storm topologies never finish processing until interrupted, which again doesn't apply to my case. I would like my topology to know that once my finite list of source data is finished, the processing can terminate and a final result can be emitted.
So, does that all make sense in Storm terms or am I looking at the wrong thing? If so, what alternatives do you propose for this sort of real time parallel computing needs?
Thanks!
Found the answer in the storm google group. Seems that DRCP topologies will emit a tuple with parameters that is received by DRCP spout as a stream and then will indicate back when the processing has finished (Using unique Id called Request ID).
In that same thread says that hadoop is probably best suited for these cases, unless the data is not big enough and can be processed entirely all the time.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With