How to make hive run mapreduce jobs concurrently?

Tags:

I'm new to hive and I have encountered a problem,

I have a table in hive like this:

create table td(id int, time string, ip string, v1 bigint, v2 int, v3 int,
v4 int, v5 bigint, v6 int)  PARTITIONED BY(dt STRING)
ROW FORMAT DELIMITED FIELDS
TERMINATED BY ','  lines TERMINATED BY '\n' ;

And I run an sql like:

from td
INSERT OVERWRITE  DIRECTORY '/tmp/total.out' select count(v1)
INSERT OVERWRITE  DIRECTORY '/tmp/totaldistinct.out' select count(distinct v1)
INSERT OVERWRITE  DIRECTORY '/tmp/distinctuin.out' select distinct v1

INSERT OVERWRITE  DIRECTORY '/tmp/v4.out' select v4 , count(v1), count(distinct v1) group by v4
INSERT OVERWRITE  DIRECTORY '/tmp/v3v4.out' select v3, v4 , count(v1), count(distinct v1) group by v3, v4

INSERT OVERWRITE  DIRECTORY '/tmp/v426.out' select count(v1), count(distinct v1)  where v4=2 or v4=6
INSERT OVERWRITE  DIRECTORY '/tmp/v3v426.out' select v3, count(v1), count(distinct v1) where v4=2 or v4=6 group by v3

INSERT OVERWRITE  DIRECTORY '/tmp/v415.out' select count(v1), count(distinct v1)  where v4=1 or v4=5
INSERT OVERWRITE  DIRECTORY '/tmp/v3v415.out' select v3, count(v1), count(distinct v1) where v4=1 or v4=5 group by v3

it works, and the output result is what I want.

but there is one problem, hive generate 9 mapreduce jobs and run these jobs one by one.

I run explain on this query, and I got the following message:

STAGE DEPENDENCIES:
  Stage-9 is a root stage
  Stage-0 depends on stages: Stage-9
  Stage-10 depends on stages: Stage-9
  Stage-1 depends on stages: Stage-10
  Stage-11 depends on stages: Stage-9
  Stage-2 depends on stages: Stage-11
  Stage-12 depends on stages: Stage-9
  Stage-3 depends on stages: Stage-12
  Stage-13 depends on stages: Stage-9
  Stage-4 depends on stages: Stage-13
  Stage-14 depends on stages: Stage-9
  Stage-5 depends on stages: Stage-14
  Stage-15 depends on stages: Stage-9
  Stage-6 depends on stages: Stage-15
  Stage-16 depends on stages: Stage-9
  Stage-7 depends on stages: Stage-16
  Stage-17 depends on stages: Stage-9
  Stage-8 depends on stages: Stage-17

it seems that stage 9-17 is corresponding to mapreduce job 0-8
but from the explain message above, stage 10-17 only depends on stage 9,
so I have an question, why job 1-8 can't run concurrently?

Or how can I make job 1-8 run concurrently?

Thank you very much for your help!

564

asked Jan 15 '12 07:01

SSolid

1 Answers

In hive-default.xml, there is a property named "hive.exec.parallel" which could enable execute job in parallel. The default value is "false". You can change it to "true" to acquire this ability. You can use another property "hive.exec.parallel.thread.number" to control how many jobs at most can be executed in parallel.

For more details: https://issues.apache.org/jira/browse/HIVE-549

answered Oct 31 '22 02:10

Kai Zhang

Related questions
                            
                                Using CSV Serde with Hive create table converts all field types to string
                            
                                Ever increasing physical memory for a Spark application in YARN
                            
                                How do I specify multiple libpath in oozie job?
                            
                                How can I Read and Transfer chunks of file with Hadoop WebHDFS?
                            
                                Spark/Hadoop - Not able to save to s3 with server side encryption
                            
                                dep interpreter not found
                            
                                How to setup Apache Spark to use local hard disk when data does not fit in RAM in local mode?
                            
                                How to count number of files under specific directory in hadoop?
                            
                                How to decrease heartbeat time of slave nodes in Hadoop
                            
                                Running from a local IDE against a remote Spark cluster
                            
                                error: not found: value assemblyJarName in assembly
                            
                                How do I restart hadoop services on dataproc cluster
                            
                                Why is Apache Orc RecordReader.searchArgument() not filtering correctly?
                            
                                How to run hive script from hive cli
                            
                                How to use new Hadoop parquet magic commiter to custom S3 server with Spark
                            
                                How to read Parquet file from S3 without spark? Java
                            
                                Need help implementing this algorithm with map Hadoop MapReduce
                            
                                How to transfer mysql table to hive?
                            
                                Running Pig query over data stored in Hive
                            
                                Accessing HBase running in VM with a client on host system

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to make hive run mapreduce jobs concurrently?

Tags:

hadoop

hive

mapreduce

SSolid

People also ask

1 Answers

Kai Zhang

Recent Activity

Donate For Us