Not able to apply dynamic partitioning for a huge data set in Hive

Tags:

I have a table test_details with some 4 million records. Using the data in this table, I have to create a new partitioned table test_details_par with records partitioned on visit_date. Creating the table is not a challenge, but when I come to the part where I have to INSERT the data using Dynamic Partitions, Hive gives up when I try to insert data for more number of days. If I do it for 2 or 3 days the Map Reduce jobs runs successfully but for more days it fails giving a JAVA Heap Space Error or GC Error.

A Simplified Snapshot of my DDLs is as follows:

CREATE TABLE test_details_par( visit_id INT, visit_date DATE, store_id SMALLINT);

INSERT INTO TABLE test_details_par PARTITION(visit_date) SELECT visit_id, store_id, visit_date FROM test_details DISTRIBUTE BY visit_date;

I have tried setting these parameters, so that Hive executes my job in a better way:

set hive.exec.dynamic.partition.mode=nonstrict; 
set hive.exec.dynamic.partition=true; 
set hive.exec.max.dynamic.partitions.pernode = 10000;

Is there anything that I am missing to run the INSERT for a complete batch without specifying the dates specifically?

760

asked Feb 19 '14 09:02

Neels

1 Answers

Neels,

Hive 12 and below have well-known scalability issues with dynamic partitioning that will be addressed with Hive 13. The problem is that Hive attempts to hold a file handle open for each and every partition it writes out, which causes out of memory and crashes. Hive 13 will sort by partition key so that it only needs to hold one file open at a time.

You have 3 options as I see

Change your job to insert only a few partitions at a time.
Wait for Hive 13 to be released and try that (2-3 months to wait).
If you know how, build Hive from trunk and use it to complete your data load.

149

answered Oct 28 '22 00:10

Carter Shanklin

Related questions
                            
                                Pyspark: shuffle RDD
                            
                                java.net.ConnectException: Your endpoint configuration is wrong;
                            
                                Which hadoop version to use?
                            
                                Running Map-Reduce job on specific files/blocks in HDFS
                            
                                Hadoop error in shuffle in fetcher#1
                            
                                Save flume output to hive table with Hive Sink
                            
                                Spark pulling data into RDD or dataframe or dataset
                            
                                hive external table needing write access
                            
                                Yarn slave nodes are not communicating with master node?
                            
                                How can I force spark/hadoop to ignore the .gz extension on a file and read it as uncompressed plain text?
                            
                                How does Hadoop's RunJar method distribute class/jar files across nodes?
                            
                                Hadoop MR source: HDFS vs HBase. Benefits of each?
                            
                                Hadoop Streaming: Mapper 'wrapping' a binary executable
                            
                                Large Data Sets - NoSQL, NewSQL, SQL..? Brain Fried
                            
                                mapreduce count example
                            
                                How to read hadoop sequential file?
                            
                                Hbase: How to specify hostname for Hbase master
                            
                                Hadoop configuration: mapred.* vs mapreduce.*
                            
                                Hive QL - Limiting number of rows per each item
                            
                                100 TB of data on Mongo DB? Possible?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Not able to apply dynamic partitioning for a huge data set in Hive

Tags:

hadoop

hive

Neels

People also ask

1 Answers

Carter Shanklin

Recent Activity

Donate For Us