Actually, there is similar question here: https://stackoverflow.com/questions/23922404/elasticsearch-hadoop-why-would-i
But the answer doesn't really satisfy me.
My questions are simple:
Why shouldn't it be used as a data warehouse?
Thank you and best regards,
Rizki Sunaryo
Connect the massive data storage and deep processing power of Hadoop with the real-time search and analytics of Elasticsearch. The Elasticsearch-Hadoop (ES-Hadoop) connector lets you get quick insight from your big data and makes working in the Hadoop ecosystem even better.
Need of Hadoop to Run SparkReal-time and faster data processing in Hadoop is not possible without Spark. On the other hand, Spark doesn't have any file system for distributed storage. However, many Big data projects deal with multi-petabytes of data that need to be stored in a distributed storage.
Elasticsearch is a NoSQL data store. It can handle changing data structures at any time without preprocessing or relationship configuration. This is extremely important for analytics.
Data fragments in Hadoop can be too large and can create bottlenecks. Thus, it is slower than Spark. Spark is much faster as it uses MLib for computations and has in-memory processing. Hadoop has a slower performance as it uses disk for storage and depends upon disk read and write operations.
I am very far from being an expert in distributed computing, but am I missing something here or are you comparing two completely different things?
Hadoop is a distributed batch computing platform, allowing you to run data extraction and transformation pipelines. ES is a search & analytic engine (or data aggregation platform), allowing you to, say, index the result of your Hadoop job for search purposes.
So a complete pipeline would be something like:
Data --> Hadoop/Spark (MapReduce or Other Paradigm) --> Curated Data --> ElasticSearch/Lucene/SOLR/etc.
You may be in situations where you just want to extract and/or transform data, and have no use of elasticsearch. You may also be in situations where your data source does not require or plays well with the distributed batch processing paradigm, in which case hadoop is no use to you.
Where you may be confused is that ES offers elasticsearch-hadoop, plugging in directly into Hadoop to offer you an "all-in-one" solution, so to speak.
Hopefully someone far more knowledgeable than me can also chip in on this.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With