Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why Hadoop or Spark? There is ElasticSearch

Actually, there is similar question here: https://stackoverflow.com/questions/23922404/elasticsearch-hadoop-why-would-i

But the answer doesn't really satisfy me.

My questions are simple:

  1. Why should we use Hadoop or Spark, when ElasticSearch exists?
  2. What is it that Hadoop or Spark has, and ElasticSearch doesn't have?
  3. If algorithm is the answer, I believe I'm no better than Kimchy in creating algorithms. While in Hadoop or Spark, we need to create our own algorithm. Again, why still Hadoop or Spark?
  4. The answer said, "Elasticsearch is a distributed search engine and it shouldn't be used as a data warehouse."

Why shouldn't it be used as a data warehouse?

Thank you and best regards,

Rizki Sunaryo

like image 690
Rizki Sunaryo Avatar asked Mar 23 '15 03:03

Rizki Sunaryo


People also ask

Does Elasticsearch use Hadoop?

Connect the massive data storage and deep processing power of Hadoop with the real-time search and analytics of Elasticsearch. The Elasticsearch-Hadoop (ES-Hadoop) connector lets you get quick insight from your big data and makes working in the Hadoop ecosystem even better.

Why Hadoop is needed for Spark?

Need of Hadoop to Run SparkReal-time and faster data processing in Hadoop is not possible without Spark. On the other hand, Spark doesn't have any file system for distributed storage. However, many Big data projects deal with multi-petabytes of data that need to be stored in a distributed storage.

Is Elasticsearch a data warehouse?

Elasticsearch is a NoSQL data store. It can handle changing data structures at any time without preprocessing or relationship configuration. This is extremely important for analytics.

Why is Spark better than Hadoop?

Data fragments in Hadoop can be too large and can create bottlenecks. Thus, it is slower than Spark. Spark is much faster as it uses MLib for computations and has in-memory processing. Hadoop has a slower performance as it uses disk for storage and depends upon disk read and write operations.


1 Answers

I am very far from being an expert in distributed computing, but am I missing something here or are you comparing two completely different things?

Hadoop is a distributed batch computing platform, allowing you to run data extraction and transformation pipelines. ES is a search & analytic engine (or data aggregation platform), allowing you to, say, index the result of your Hadoop job for search purposes.

So a complete pipeline would be something like:

Data --> Hadoop/Spark (MapReduce or Other Paradigm) --> Curated Data --> ElasticSearch/Lucene/SOLR/etc.

You may be in situations where you just want to extract and/or transform data, and have no use of elasticsearch. You may also be in situations where your data source does not require or plays well with the distributed batch processing paradigm, in which case hadoop is no use to you.

Where you may be confused is that ES offers elasticsearch-hadoop, plugging in directly into Hadoop to offer you an "all-in-one" solution, so to speak.

Hopefully someone far more knowledgeable than me can also chip in on this.

like image 108
Matt Fortier Avatar answered Sep 24 '22 06:09

Matt Fortier