Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast Hadoop Analytics (Cloudera Impala vs Spark/Shark vs Apache Drill)

I want to do some "near real-time" data analysis (OLAP-like) on the data in a HDFS.
My research showed that the three mentioned frameworks report significant performance gains compared to Apache Hive. Does anyone have some practical experience with either one of those? Not only concerning performance, but also with respect of stability?

like image 581
user2306380 Avatar asked Jun 25 '13 06:06

user2306380


People also ask

Why spark is faster than Hadoop MapReduce?

Performance: Spark is faster because it uses random access memory (RAM) instead of reading and writing intermediate data to disks. Hadoop stores data on multiple sources and processes it in batches via MapReduce.

What is the difference between Cloudera Impala and Hive?

Apache Hive might not be ideal for interactive computing whereas Impala is meant for interactive computing. Hive is batch based Hadoop MapReduce whereas Impala is more like MPP database. Hive supports complex types but Impala does not. Apache Hive is fault-tolerant whereas Impala does not support fault tolerance.

Which is faster between spark and Hive and why?

Speed: – The operations in Hive are slower than Apache Spark in terms of memory and disk processing as Hive runs on top of Hadoop. Read/Write operations: – The number of read/write operations in Hive are greater than in Apache Spark. This is because Spark performs its intermediate operations in memory itself.

Is Apache spark a replacement of Hadoop?

Apache Spark doesn't replace Hadoop, rather it runs atop existing Hadoop cluster to access Hadoop Distributed File System. Apache Spark also has the functionality to process structured data in Hive and streaming data from Flume, Twitter, HDFS, Flume, etc.


2 Answers

Comparison between Hive and Impala or Spark or Drill sometimes sounds inappropriate to me. The goals behind developing Hive and these tools were different. Hive was never developed for real-time, in memory processing and is based on MapReduce. It was built for offline batch processing kinda stuff. Best suited when you need long running jobs performing data heavy operations like joins on very huge datasets.

On the other hand these tools were developed keeping the real-timeness in mind. Go for them when you need to query not very huge data, that can be fit into the memory, real-time. I'm not saying you can't run queries on your BigData using these tools, but you would be pushing the limits if you are running real-time queries on PBs of data, IMHO.

Quite often you would have seen(or read) that a particular company has several PBs of data and they are successfully catering real-time needs of their customers. But actually these companies are not querying their entire data most of the time. So, the important thing is proper planning, when to use what. I hope you get the point i'm trying to make.

Coming back to your actual question, in my view it is hard to provide a reasonable comparison at this time since most of these projects are far from completed. They are not production ready yet, unless you are willing to do some(or maybe a lot) of work on your own. And, for each of these projects there are certain goals which are very specific to that particular project.

For example, Impala was developed to take advantage of existing Hive infrastructure so that you don't have to start from scratch. It uses the same metadata which Hive uses. It's goal was to run real-time queries on top of your existing Hadoop warehouse. Whereas Drill was developed to be a not only Hadoop project. And to provide us a distributed query capabilities across multiple big data platforms including MongoDB, Cassandra, Riak and Splunk. Shark is compatible with Apache Hive, which means that you can query it using the same HiveQL statements as you would through Hive. The difference is that Shark can return results up to 30 times faster than the same queries run on Hive.

Impala is doing good at present and some folks have been using it, but i'm not that confident about rest of the 2. All these tools are good but a fair comparison can be made only after you try these on your data and for your processing needs. But as per my experience Impala would be the best bet at this moment. I am not saying other tools are not good, but they are not yet mature enough. But if you wish to use it with your already running Hadoop cluster(Apache's hadoop for ex) you might have to do some additional work as Impala is used almost by everybody as a CDH feature.

Note : All these things as based on solely my experience. If you find something wrong or inappropriate please do let me know. Comments and suggestions are welcome. And I hope this answers some of your queries.

like image 137
Tariq Avatar answered Oct 19 '22 09:10

Tariq


Here is an answer of "How does Impala compare to Shark?" from Reynold Xin, the leader of the Shark development effort at UC Berkeley AMPLab.

like image 2
lf.xiao Avatar answered Oct 19 '22 09:10

lf.xiao