Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Storm compared to Hadoop

How does Storm compare to Hadoop? Hadoop seems to be the defacto standard for open-source large scale batch processing, does Storm has any advantages over hadoop? or Are they completely different?

like image 367
18bytes Avatar asked Jun 28 '12 17:06

18bytes


People also ask

Is Apache Hadoop different from Hadoop?

Apache Hadoop is a collection of open-source modules and utilities intended to make the process of storing, managing and analyzing big data easier. Apache Hadoop's modules include Hadoop YARN, Hadoop MapReduce and Hadoop Ozone, but it supports many optional data science software packages.

Is Apache storm still used?

We currently use Storm as our Twitter realtime data processing pipeline. We have Storm topologies for content filtering, geolocalisation and classification.

Which is better Apache spark or Hadoop?

Performance: Spark is faster because it uses random access memory (RAM) instead of reading and writing intermediate data to disks. Hadoop stores data on multiple sources and processes it in batches via MapReduce.

What is better than Hadoop?

Apache Spark runs applications up to 100x faster in memory and 10x faster on disk than Hadoop. Because of reducing the number of read/write cycle to disk and storing intermediate data in-memory Spark makes it possible.


2 Answers

Why don't you tell your opinion.

  • http://www.infoq.com/news/2011/09/twitter-storm-real-time-hadoop/
  • http://engineering.twitter.com/2011/08/storm-is-coming-more-details-and-plans.html

Twitter Storm has been touted as real time Hadoop. That is more a marketing take for easy consumption.

They are superficially similar since both are distributed application solutions. Apart from the typical distributed architectural elements like master/slave, zookeeper based coordination, to me comparison falls off the cliff.

Twitter is more like a pipline for processing data as it comes. The pipe is what connects various computing nodes that receive data, compute and deliver output. (There lingo is spouts and bolts) Extend this analogy to a complex pipeline wiring that can be re-engineered when required and you get Twitter Storm.

In nut shell it processes data as it comes. There is no latency.

Hadoop how ever is different in this respect primarily due to HDFS. It a solution geared to distributed storage and tolerance to outage of many scales (disks, machines, racks etc)

M/R is built to leverage data localization on HDFS to distribute computational jobs. Together, they do not provide facility for real time data processing. But that is not always a requirement when you are looking through large data. (needle in the haystack analogy)

In short, Twitter Storm is a distributed real time data processing solution. I don't think we should compare them. Twitter built it because it needed a facility to process small tweets but humungous number of them and in real time.

See: HStreaming if you are compelled to compare it with some thing

like image 167
pyfunc Avatar answered Sep 29 '22 20:09

pyfunc


Basically, both of them are used for analyzing big data, but Storm is used for real time processing while Hadoop is used for batch processing.

This is a very good introduction to Storm that I found: Click here

like image 23
Dao Lam Avatar answered Sep 29 '22 20:09

Dao Lam