Actually, there is similar question here: https://stackoverflow.com/questions/23922404/elasticsearch-hadoop-why-would-i But the answer doesn't really satisfy me. My questions are simple: <ol> <li>Why should we use Hadoop or Spark, when ElasticSearch exists?</li> <li>What is it that Hadoop or Spark has, and ElasticSearch doesn't have? </li> <li>If algorithm is the answer, I believe I'm no better than Kimchy in creating algorithms. While in Hadoop or Spark, we need to create our own algorithm. Again, why still Hadoop or Spark?</li> <li>The answer said, "Elasticsearch is a distributed search engine and it shouldn't be used as a data warehouse."</li> </ol> Why shouldn't it be used as a data warehouse? Thank you and best regards, Rizki Sunaryo

I am very far from being an expert in distributed computing, but am I missing something here or are you comparing two completely different things? Hadoop is a distributed batch computing platform, allowing you to run data extraction and transformation pipelines. ES is a search & analytic engine (or data aggregation platform), allowing you to, say, index the result of your Hadoop job for search purposes. So a complete pipeline would be something like: Data --> Hadoop/Spark (MapReduce or Other Paradigm) --> Curated Data --> ElasticSearch/Lucene/SOLR/etc. You may be in situations where you just want to extract and/or transform data, and have no use of elasticsearch. You may also be in situations where your data source does not require or plays well with the distributed batch processing paradigm, in which case hadoop is no use to you. Where you may be confused is that ES offers elasticsearch-hadoop, plugging in directly into Hadoop to offer you an "all-in-one" solution, so to speak. Hopefully someone far more knowledgeable than me can also chip in on this.

Why Hadoop or Spark? There is ElasticSearch

1 Answers

I am very far from being an expert in distributed computing, but am I missing something here or are you comparing two completely different things?

Hadoop is a distributed batch computing platform, allowing you to run data extraction and transformation pipelines. ES is a search & analytic engine (or data aggregation platform), allowing you to, say, index the result of your Hadoop job for search purposes.

So a complete pipeline would be something like:

Data --> Hadoop/Spark (MapReduce or Other Paradigm) --> Curated Data --> ElasticSearch/Lucene/SOLR/etc.

You may be in situations where you just want to extract and/or transform data, and have no use of elasticsearch. You may also be in situations where your data source does not require or plays well with the distributed batch processing paradigm, in which case hadoop is no use to you.

Where you may be confused is that ES offers elasticsearch-hadoop, plugging in directly into Hadoop to offer you an "all-in-one" solution, so to speak.

Hopefully someone far more knowledgeable than me can also chip in on this.

108

answered Sep 24 '22 06:09

Matt Fortier

Related questions
                            
                                When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment
                            
                                Debugging hadoop applications
                            
                                Suggestions needed for optimizing O(n^2) algorithm
                            
                                Apache Pig permissions issue
                            
                                In Hadoop where does the framework save the output of the Map task in a normal Map-Reduce Application?
                            
                                Name of Hive table is now a reserved keyword
                            
                                Where are the hadoop-examples* and hadoop-test* jars in Cloudera CDH?
                            
                                Junit External Resource @Rule Order
                            
                                How to run Hadoop on a Mesos cluster?
                            
                                java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration
                            
                                Loading CSV file on Hive Table with String Array
                            
                                What is --direct mode in sqoop?
                            
                                How to use NOT IN in Hive
                            
                                realtime querying/aggregating millions of records - hadoop? hbase? cassandra?
                            
                                Get input file name in streaming hadoop program
                            
                                Errors while running hadoop
                            
                                Type mismatch in key from map: expected .. Text, received ... LongWritable
                            
                                HBase 0.92 warnings about SLF4J bindings
                            
                                "Connection refused" Error for Namenode-HDFS (Hadoop Issue)
                            
                                What is the maximum value for mapreduce.task.io.sort.mb?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why Hadoop or Spark? There is ElasticSearch

Tags:

elasticsearch

apache-spark

hadoop

Rizki Sunaryo

People also ask

1 Answers

Matt Fortier

Recent Activity

Donate For Us