Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to execute Hive queries parallelly by writing seperate mapreduce program?

Tags:

hive

mapreduce

I have asked some of the questions on increasing the performance of Hive queries. Some of the answers were pertaining to number of mappers and reducers. I tried with multiple mappers and reducers but I didn't see any difference in the execution. Don't know why, may be I did not do it in the right way or I missed something else.

I would like to know is it possible to execute Hive queries in parallell? What exactly I mean is, normally the queries get executed in a queue. For instance: query1

query2

query3

. . . n

It takes too much time to execute and I want to reduce the execution time.

I need to know if we use mapreduce program in Hive JDBC program then is it possible to execute it in parallel? Don't know if that will work or not but that's my aim to achieve?

I am reinstating my questions below:

1) If it is possible to run multiple hive queries in parallel, does it requires multiple Hive Thrift Server?

2) Is it possible to open multiple Hive Thrift Servers?

3) I think it is not possible to open multiple Hive Thrift Server on same port?

4) Can we open multiple Hive Thrift Server on different ports?

Please suggest me some solution for this. If you have any other alternative I will try that as well.

like image 427
Bhavesh Shah Avatar asked May 11 '12 11:05

Bhavesh Shah


1 Answers

As you might already know, Hive is a SQL-like front-end to Hadoop and Map-reduce. Any non-trivial query on Hive gets compiled to Map-Reduce and run on Hadoop. Map-reduce is a parallel processing framework, therefore each of your Hive queries will run and process data in parallel. Hive uses a FIFO scheduler by default to schedule jobs on Hadoop, therefore, only one Hive query can be executed at a given time and the next query would be executed when the first one is done. In most circumstances, I would suggest people to optimize individual Hive queries instead of parallelizing multiple Hive queries. If you are inclined towards parallelizing Hive queries, it might be an indicative of your cluster being used inefficiently. To further analyze the performance and usage of your Hive queries, you can install a distributed monitoring system like Ganglia for monitoring the usage of your cluster (Amazon EMR supports it too).

Long story short, you don't have to write a map-reduce program; that's what you are using Hive for in the first place. However, if there is something you might know about the data that Hive might not, it might result in sub-optimal performance of your Hive queries. For example, your data might be sorted by some column and Hive might not know about that information. In such cases, if you can't set that additional meta-information in Hive, it might make sense to write a map-reduce job that takes that additional information into account and potentially gives you better performance. In most cases, I have found Hive performance to be at-par with Map-reduce jobs corresponding to the Hive query.

like image 59
Mark Grover Avatar answered Nov 01 '22 21:11

Mark Grover