Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to make hive run mapreduce jobs concurrently?

I'm new to hive and I have encountered a problem,

I have a table in hive like this:

create table td(id int, time string, ip string, v1 bigint, v2 int, v3 int,
v4 int, v5 bigint, v6 int)  PARTITIONED BY(dt STRING)
ROW FORMAT DELIMITED FIELDS
TERMINATED BY ','  lines TERMINATED BY '\n' ;  

And I run an sql like:

from td
INSERT OVERWRITE  DIRECTORY '/tmp/total.out' select count(v1)
INSERT OVERWRITE  DIRECTORY '/tmp/totaldistinct.out' select count(distinct v1)
INSERT OVERWRITE  DIRECTORY '/tmp/distinctuin.out' select distinct v1

INSERT OVERWRITE  DIRECTORY '/tmp/v4.out' select v4 , count(v1), count(distinct v1) group by v4
INSERT OVERWRITE  DIRECTORY '/tmp/v3v4.out' select v3, v4 , count(v1), count(distinct v1) group by v3, v4

INSERT OVERWRITE  DIRECTORY '/tmp/v426.out' select count(v1), count(distinct v1)  where v4=2 or v4=6
INSERT OVERWRITE  DIRECTORY '/tmp/v3v426.out' select v3, count(v1), count(distinct v1) where v4=2 or v4=6 group by v3

INSERT OVERWRITE  DIRECTORY '/tmp/v415.out' select count(v1), count(distinct v1)  where v4=1 or v4=5
INSERT OVERWRITE  DIRECTORY '/tmp/v3v415.out' select v3, count(v1), count(distinct v1) where v4=1 or v4=5 group by v3

it works, and the output result is what I want.

but there is one problem, hive generate 9 mapreduce jobs and run these jobs one by one.

I run explain on this query, and I got the following message:

STAGE DEPENDENCIES:
  Stage-9 is a root stage
  Stage-0 depends on stages: Stage-9
  Stage-10 depends on stages: Stage-9
  Stage-1 depends on stages: Stage-10
  Stage-11 depends on stages: Stage-9
  Stage-2 depends on stages: Stage-11
  Stage-12 depends on stages: Stage-9
  Stage-3 depends on stages: Stage-12
  Stage-13 depends on stages: Stage-9
  Stage-4 depends on stages: Stage-13
  Stage-14 depends on stages: Stage-9
  Stage-5 depends on stages: Stage-14
  Stage-15 depends on stages: Stage-9
  Stage-6 depends on stages: Stage-15
  Stage-16 depends on stages: Stage-9
  Stage-7 depends on stages: Stage-16
  Stage-17 depends on stages: Stage-9
  Stage-8 depends on stages: Stage-17

it seems that stage 9-17 is corresponding to mapreduce job 0-8
but from the explain message above, stage 10-17 only depends on stage 9,
so I have an question, why job 1-8 can't run concurrently?

Or how can I make job 1-8 run concurrently?

Thank you very much for your help!

like image 564
SSolid Avatar asked Jan 15 '12 07:01

SSolid


People also ask

Does hive run MapReduce?

Hive transforms HiveQL queries into MapReduce or Tez jobs that run on Apache Hadoop's distributed job scheduling framework, Yet Another Resource Negotiator (YARN). It queries data stored in a distributed storage solution, like the Hadoop Distributed File System (HDFS) or Amazon S3.

How does MapReduce work in hive?

Hive Architecture An SQL query gets converted into a MapReduce app by going through the following process: The Hive client or UI submits a query to the driver. The driver then submits the query to the Hive compiler, which generates a query plan and converts the SQL into MapReduce tasks.

When to use Hive?

Hive should be used for analytical querying of data collected over a period of time — for instance, to calculate trends or website logs. We typically see two Hive use cases: A SQL query engine for HDFS - Hive can be a significant source of your SQL queries.


1 Answers

In hive-default.xml, there is a property named "hive.exec.parallel" which could enable execute job in parallel. The default value is "false". You can change it to "true" to acquire this ability. You can use another property "hive.exec.parallel.thread.number" to control how many jobs at most can be executed in parallel.

For more details: https://issues.apache.org/jira/browse/HIVE-549

like image 73
Kai Zhang Avatar answered Oct 31 '22 02:10

Kai Zhang