It would be great if somebody can explain me the architectural differences between Twitter Storm and Apache Hadoop? I am looking out for some internals stuff beyond real time v/s batch processing. As both technologies are quiet similar in terms of writing a topology for Storm or map-reduce on Hadoop, in terms of task tracker/job tracker for Hadoop and the equivalent nimbus/supervisor for Storm, in terms of Hadoop partition and equivalent shuffling (random, field etc.) on Storm etc. (Am I correct if I say that Storm uses message queues internally for transporting data between spouts/bolt which is not exactly the case with Hadoop where in there are intermediate files created and hence an I/O involved.)
EDIT:
I have gone through the question Apache Storm compared to Hadoop but the accepted answer leaves me with a desire to know more than just the use case i.e. real time v/s batch processing.
The main diffence is that Storm can do realtime processing of streams of Tupple s (incoming data) while Hadoop do batch processing with MapReduce jobs.
both of them process data in a distributed way, but with storm you can have live analitics while you will have to wait the mapreduce job to finish before playing with your results.
Nathan Marz (Storm creator) is writing a book about Big Data where he discusses how to create big data systems with Hadoop, Storm and other technologies.
The book is discussing "The Lambda Architecture". Checkout this slide by Nathan Marz himself: Runaway complexity in Big Data... and a plan to stop it
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With