how to decrease the number of mapper in hive while the file is bigger than block size?

Question

guys I have a table in hive which have more than 720 partitions，and in each partition there is more than 400 files and the file's average size is 1G.

Now I execute following SQL: insert overwrite table test_abc select * from DEFAULT.abc A WHERE A.P_HOUR ='2017042400' ;

this partition (P_HOUR ='2017042400' )have 409 files. When I submit this sql ,I got following output

INFO : Number of reduce tasks is set to 0 since there's no reduce operator INFO : number of splits:409

INFO : Submitting tokens for job: job_1482996444961_9384015

I google many doc to find how to decrease the number of mapper, lots of doc solved this problem when the file is small. I have tried the following set in beeline, but not work ---------------first time

set mapred.min.split.size =5000000000;
set mapred.max.split.size =10000000000;
set mapred.min.split.size.per.node=5000000000;
set mapred.min.split.size.per.rack=5000000000;

-----------------second time

set mapreduce.input.fileinputformat.split.minsize =5000000000;
set mapreduce.input.fileinputformat.split.maxsize=10000000000;
set mapreduce.input.fileinputformat.split.minsize.per.rack=5000000000;
set mapreduce.input.fileinputformat.split.minsize.per.node=5000000000;

my hadoop version is Hadoop 2.7.2 Compiled by root on 11 Jul 2016 10:58:45 hive version is Connected to: Apache Hive (version 1.3.0) Driver: Hive JDBC (version 1.3.0)

David דודו Markovitz · Accepted Answer

In addition to the setup in your post

set hive.hadoop.supports.splittable.combineinputformat=true;

hive.hadoop.supports.splittable.combineinputformat
- Default Value: false
- Added In: Hive 0.6.0
Whether to combine small input files so that fewer mappers are spawned.

Sandeep Singh · Answer

MRv2 uses CombineInputFormat, while Tez uses grouped splits to determine the Mapper. If your execution engine is mr and you would like to reduce Mappers use:

mapreduce.input.fileinputformat.split.maxsize=xxxxx

If maxSplitSize is specified, then blocks on the same node are combined to a single split. Blocks that are left over are then combined with other blocks in the same rack. If maxSplitSize is not specified, then blocks from the same rack are combined in a single split; no attempt is made to create node-local splits. If the maxSplitSize is equal to the block size, then this class is similar to the default splitting behavior in Hadoop

This link can be helpful to control Mapper in Hive if your execution engine is mr

If your execution engine is tez and you would lile to control Mappers then use:

set tez.grouping.max-size = XXXXXX;

Here is a good read reference for the parallelism in Hive for tez execution engine,

how to decrease the number of mapper in hive while the file is bigger than block size?

Tags:

hive

mapper

lance

2 Answers

David דודו Markovitz

Sandeep Singh

Recent Activity

Donate For Us

how to decrease the number of mapper in hive while the file is bigger than block size?

Tags:

hive

mapper

lance

2 Answers

David דודו Markovitz

Sandeep Singh

Related questions

Recent Activity

Donate For Us