guys I have a table in hive which have more than 720 partitions,and in each partition there is more than 400 files and the file's average size is 1G.
Now I execute following SQL: insert overwrite table test_abc select * from DEFAULT.abc A WHERE A.P_HOUR ='2017042400' ;
this partition (P_HOUR ='2017042400' )have 409 files. When I submit this sql ,I got following output
INFO : Number of reduce tasks is set to 0 since there's no reduce operator INFO : number of splits:409
INFO : Submitting tokens for job: job_1482996444961_9384015
I google many doc to find how to decrease the number of mapper, lots of doc solved this problem when the file is small. I have tried the following set in beeline, but not work ---------------first time
set mapred.min.split.size =5000000000;
set mapred.max.split.size =10000000000;
set mapred.min.split.size.per.node=5000000000;
set mapred.min.split.size.per.rack=5000000000;
-----------------second time
set mapreduce.input.fileinputformat.split.minsize =5000000000;
set mapreduce.input.fileinputformat.split.maxsize=10000000000;
set mapreduce.input.fileinputformat.split.minsize.per.rack=5000000000;
set mapreduce.input.fileinputformat.split.minsize.per.node=5000000000;
my hadoop version is Hadoop 2.7.2 Compiled by root on 11 Jul 2016 10:58:45 hive version is Connected to: Apache Hive (version 1.3.0) Driver: Hive JDBC (version 1.3.0)
In addition to the setup in your post
set hive.hadoop.supports.splittable.combineinputformat=true;
hive.hadoop.supports.splittable.combineinputformat
- Default Value: false
- Added In: Hive 0.6.0
Whether to combine small input files so that fewer mappers are spawned.
MRv2 uses CombineInputFormat, while Tez uses grouped splits to determine the Mapper. If your execution engine is mr and you would like to reduce Mappers use:
mapreduce.input.fileinputformat.split.maxsize=xxxxx
If maxSplitSize is specified, then blocks on the same node are combined to a single split. Blocks that are left over are then combined with other blocks in the same rack. If maxSplitSize is not specified, then blocks from the same rack are combined in a single split; no attempt is made to create node-local splits. If the maxSplitSize is equal to the block size, then this class is similar to the default splitting behavior in Hadoop
This link can be helpful to control Mapper in Hive if your execution engine is mr
If your execution engine is tez and you would lile to control Mappers then use:
set tez.grouping.max-size = XXXXXX;
Here is a good read reference for the parallelism in Hive for tez execution engine,
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With