I have recently started looking into querying large sets of CSV data lying on HDFS using Hive and Impala. As I was expecting, I get better response time with Impala compared to Hive for the queries I have used so far. I am wondering if there are some types of queries/use cases that still need Hive and where Impala is not a good fit. How does Impala provide faster query response compared to Hive for the same data on HDFS?

There are some key features in impala that makes its fast. <ol> <li>It does not use map/reduce which are very expensive to fork in separate jvms. It runs separate Impala Daemon which splits the query and runs them in parallel and merge result set at the end. </li> <li>It does most of its operation in-memory. </li> <li>It uses hdfs for its storage which is fast for large files. It caches as much as possible from queries to results to data. </li> <li>It supports new file format like parquet, which is columnar file format. So if you use this format it will be faster for queries where you are accessing only few columns most of the time.</li> </ol>

Impala doesn't replace MapReduce or use MapReduce as a processing engine.Let's first understand key difference between Impala and Hive. <ol> <li>Impala performs in-memory query processing while Hive does not</li> <li>Hive use MapReduce to process queries, while Impala uses its own processing engine.</li> <li>Hive can be extended using User Defined Functions (UDF) or writing a custom Serializer/Deserializer (SerDes); however, Impala does not support extensibility as Hive does for now</li> <li>Impala depends on Hive to function, while Hive does not depend on any other application and just needs the core Hadoop platform (HDFS and MapReduce)</li> <li>Impala queries are subsets of HiveQL, which means that almost every Impala query (with a few limitation) can run in Hive. But vice-versa is not true because some of the HiveQL features supported in Hive are not supported in Impala.</li> </ol> Now why Impala is faster than Hive in Query processing? Below are the some key points. <ol> <li>While processing SQL-like queries, Impala does not write intermediate results on disk(like in Hive MapReduce); instead full SQL processing is done in memory, which makes it faster.</li> <li>With Impala, the query starts its execution instantly compared to MapReduce, which may take significant time to start processing larger SQL queries and this adds more time in processing.</li> <li>Impala Query Planner uses smart algorithms to execute queries in multiple stages in parallel nodes to provide results faster, avoiding sorting and shuffle steps, which may be unnecessary in most of the cases.</li> <li>Impala has information about each data block in HDFS, so when processing the query, it takes advantage of this knowledge to distribute queries more evenly in all DataNodes.</li> <li>There exists Impala daemon, which runs on each DataNode. These are responsible for processing queries.When query submitted, impalad(Impala daemon) reads and writes to data file and parallelizes the query by distributing the work to all other Impala nodes in the Impala cluster. </li> <li>Another key reason for fast performance is that Impala first generates assembly-level code for each query. The assembly code executes faster than any other code framework because while Impala queries are running natively in memory, having a framework will add additional delay in the execution due to the framework overhead.</li> </ol> Impala processes all queries in memory, so memory limitation on nodes is definitely a factor. You must have enough memory to support the resultant dataset, which could grow multifold during complex JOIN operations. If a query starts processing the data and the resultant dataset cannot fit in the available memory, the query will fail.

How does impala provide faster query response compared to hive

Tags:

hadoop

hive

impala

I have recently started looking into querying large sets of CSV data lying on HDFS using Hive and Impala. As I was expecting, I get better response time with Impala compared to Hive for the queries I have used so far.

I am wondering if there are some types of queries/use cases that still need Hive and where Impala is not a good fit.

How does Impala provide faster query response compared to Hive for the same data on HDFS?

746

asked May 26 '13 02:05

techuser soma

4 Answers

You should see Impala as "SQL on HDFS", while Hive is more "SQL on Hadoop".

In other words, Impala doesn't even use Hadoop at all. It simply has daemons running on all your nodes which cache some of the data that is in HDFS, so that these daemons can return data quickly without having to go through a whole Map/Reduce job.

The reason for this is that there is a certain overhead involved in running a Map/Reduce job, so by short-circuiting Map/Reduce altogether you can get some pretty big gain in runtime.

That being said, Impala does not replace Hive, it is good for very different use cases. Impala doesn't provide fault-tolerance compared to Hive, so if there is a problem during your query then it's gone. Definitely for ETL type of jobs where failure of one job would be costly I would recommend Hive, but Impala can be awesome for small ad-hoc queries, for example for data scientists or business analysts who just want to take a look and analyze some data without building robust jobs. Also from my personal experience, Impala is still not very mature, and I've seen some crashes sometimes when the amount of data is larger than available memory.

167

answered Oct 02 '22 01:10

Charles Menguy

IMHO, SQL on HDFS and SQL on Hadoop are the same. After all Hadoop is HDFS( and also MapReduce). So when we say SQL on HDFS, it is understood that it is SQL on Hadoop(could be with or without MapReduce).

Coming back to the actual question, Impala provides faster response as it uses MPP(massively parallel processing) unlike Hive which uses MapReduce under the hood, which involves some initial overheads (as Charles sir has specified). Massively parallel processing is a type of computing that uses many separate CPUs running in parallel to execute a single program where each CPU has it's own dedicated memory. The very fact that Impala, being MPP based, doesn't involve the overheads of a MapReduce jobs viz. job setup and creation, slot assignment, split creation, map generation etc., makes it blazingly fast.

But that doesn't mean that Impala is the solution to all your problems. Being highly memory intensive (MPP), it is not a good fit for tasks that require heavy data operations like joins etc., as you just can't fit everything into the memory. This is where Hive is a better fit.

So, if you need real time, ad-hoc queries over a subset of your data go for Impala. And if you have batch processing kinda needs over your Big Data go for Hive.

HTH

answered Oct 02 '22 01:10

Tariq

There are some key features in impala that makes its fast.

It does not use map/reduce which are very expensive to fork in separate jvms. It runs separate Impala Daemon which splits the query and runs them in parallel and merge result set at the end.
It does most of its operation in-memory.
It uses hdfs for its storage which is fast for large files. It caches as much as possible from queries to results to data.
It supports new file format like parquet, which is columnar file format. So if you use this format it will be faster for queries where you are accessing only few columns most of the time.

answered Oct 02 '22 01:10

Animesh Raj Jha

Impala doesn't replace MapReduce or use MapReduce as a processing engine.Let's first understand key difference between Impala and Hive.

Impala performs in-memory query processing while Hive does not
Hive use MapReduce to process queries, while Impala uses its own processing engine.
Hive can be extended using User Defined Functions (UDF) or writing a custom Serializer/Deserializer (SerDes); however, Impala does not support extensibility as Hive does for now
Impala depends on Hive to function, while Hive does not depend on any other application and just needs the core Hadoop platform (HDFS and MapReduce)
Impala queries are subsets of HiveQL, which means that almost every Impala query (with a few limitation) can run in Hive. But vice-versa is not true because some of the HiveQL features supported in Hive are not supported in Impala.

Now why Impala is faster than Hive in Query processing? Below are the some key points.

While processing SQL-like queries, Impala does not write intermediate results on disk(like in Hive MapReduce); instead full SQL processing is done in memory, which makes it faster.
With Impala, the query starts its execution instantly compared to MapReduce, which may take significant time to start processing larger SQL queries and this adds more time in processing.
Impala Query Planner uses smart algorithms to execute queries in multiple stages in parallel nodes to provide results faster, avoiding sorting and shuffle steps, which may be unnecessary in most of the cases.
Impala has information about each data block in HDFS, so when processing the query, it takes advantage of this knowledge to distribute queries more evenly in all DataNodes.
There exists Impala daemon, which runs on each DataNode. These are responsible for processing queries.When query submitted, impalad(Impala daemon) reads and writes to data file and parallelizes the query by distributing the work to all other Impala nodes in the Impala cluster.
Another key reason for fast performance is that Impala first generates assembly-level code for each query. The assembly code executes faster than any other code framework because while Impala queries are running natively in memory, having a framework will add additional delay in the execution due to the framework overhead.

Impala processes all queries in memory, so memory limitation on nodes is definitely a factor. You must have enough memory to support the resultant dataset, which could grow multifold during complex JOIN operations.

If a query starts processing the data and the resultant dataset cannot fit in the available memory, the query will fail.

answered Oct 02 '22 01:10

KZapagol

Related questions
                            
                                Hive load CSV with commas in quoted fields
                            
                                Where are logs in Spark on YARN?
                            
                                How to Access Hive via Python?
                            
                                How to restart a failed task on Airflow
                            
                                Hadoop: «ERROR : JAVA_HOME is not set»
                            
                                How to overwrite the existing files using hadoop fs -copyToLocal command
                            
                                What is the relation between 'mapreduce.map.memory.mb' and 'mapred.map.child.java.opts' in Apache Hadoop YARN?
                            
                                Permission denied at hdfs
                            
                                Java vs Python on Hadoop
                            
                                How to stop/kill Airflow tasks from the UI
                            
                                How to load data to hive from HDFS without removing the source file?
                            
                                Just get column names from hive table
                            
                                Where does hadoop mapreduce framework send my System.out.print() statements ? (stdout)
                            
                                Does Hive have a String split function?
                            
                                Namenode not getting started
                            
                                Hbase quickly count number of rows
                            
                                Scalable Image Storage
                            
                                Difference between hadoop fs -put and hadoop fs -copyFromLocal
                            
                                PIG how to count a number of rows in alias
                            
                                How does Hive compare to HBase?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How does impala provide faster query response compared to hive

Tags:

hadoop

hive

impala

techuser soma

People also ask

4 Answers

Charles Menguy

Tariq

Animesh Raj Jha

KZapagol

Recent Activity

Donate For Us