hive compaction using insert overwrite partition

Tags:

Trying to address the small files problem by compacting the files under hive partitions by Insert overwrite partition command in hadoop.

Query :

SET hive.exec.compress.output=true;
SET mapred.max.split.size=256000000;
SET mapred.output.compression.type=BLOCK;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
set hive.merge.mapredfiles=true;
set hive.merge.size.per.task=256000000;
set hive.merge.smallfiles.avgsize=256000000;


INSERT OVERWRITE TABLE tbl1 PARTITION (year=2016, month=03, day=11) 
SELECT col1,col2,col3  from tbl1  
WHERE year=2016 and  month=03 and  day=11;

Input Files:

For testing purpose I have three files under the hive partition (2016/03/11) in HDFS with the size of 40 MB each.

2016/03/11/file1.csv

2016/03/11/file2.csv

2016/03/11/file3.csv

Example my block size is 128 , So I would like to create only one output files. But I am getting 3 different compressed files.

Please help me to get the hive configuration to restrict the output file size. If I am not using the compression I am getting the single file.

Hive Version : 1.1

269

asked Mar 24 '16 09:03

William R

1 Answers

It's interesting that you are still getting 3 files when specifying the partition when using compression so you may want to look into dynamic partitioning or ditch the partitioning and focus on the number of mappers and reducers being created by your job. If your files are small I could see how you would want them all in one file on your target, but then I would also question the need for compression on them.

The number of files created in your target is directly tied to the number of reducers or mappers. If the SQL you write needs to reduce then the number of files created will be the same as the number of reducers used in the job. This can be controlled by setting the number of reducers used in the job.

set mapred.reduce.tasks = 1;

In your example SQL there most likely wouldn't be any reducers used, so the number of files in the target is equal to the number of mappers used which is equal to the number of files in the source. It isn't as easy to control the number of output files on a map only job but there are a number of configuration settings that can be tried.

Setting to combine small input files so fewer mappers are spawned, the default is false.

set hive.hadoop.supports.splittable.combineinputformat = true;

Try setting a threshold in bytes for the input files, anything under this threshold would try to be converted to a map join which can affect the number of output files.

set hive.mapjoin.smalltable.filesize = 25000000;

As for the compression I would play with changing the type of compression being used just to see if that makes any difference in your output.

set hive.exec.orc.default.compress = gzip, snappy, etc...

answered Nov 14 '22 18:11

Jared

Related questions
                            
                                hadoop multiple already being created exception
                            
                                Using s3distcp with Amazon EMR to copy a single file
                            
                                Hive performance
                            
                                Hadoop ClassNotFoundException related to MapClass
                            
                                File jobtracker.info could only be replicated to 0 nodes, instead of 1
                            
                                Overriding RecordReader to read Paragraph at once instead of line
                            
                                Hadoop profile output - where and what?
                            
                                Getting "No space left on device" for approx. 10 GB of data on EMR m1.large instances
                            
                                How to run a Spark-java program from command line [closed]
                            
                                Apache Spark Throws java.lang.IllegalStateException: unread block data
                            
                                Hadoop: HDFS File Writes & Reads
                            
                                Oozie Java Action : Passing Hbase classpath
                            
                                Why hive doesn't allow create external table with CTAS?
                            
                                Opening a port on HDInsight cluster on Azure
                            
                                Magic byte in Apache Kafka
                            
                                Apache Drill connection through Java
                            
                                How to set configuration in Hive-Site.xml file for hive metastore connection?
                            
                                How to decide when to use a Map-Side Join or Reduce-Side while writing an MR code in java?
                            
                                nutch 1.10 input path does not exist /linkdb/current
                            
                                parquet version used to write a file

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

hive compaction using insert overwrite partition

Tags:

hql

hadoop

hive

hdfs

William R

People also ask

1 Answers

Jared

Recent Activity

Donate For Us