Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Not able to apply dynamic partitioning for a huge data set in Hive

Tags:

hadoop

hive

I have a table test_details with some 4 million records. Using the data in this table, I have to create a new partitioned table test_details_par with records partitioned on visit_date. Creating the table is not a challenge, but when I come to the part where I have to INSERT the data using Dynamic Partitions, Hive gives up when I try to insert data for more number of days. If I do it for 2 or 3 days the Map Reduce jobs runs successfully but for more days it fails giving a JAVA Heap Space Error or GC Error.

A Simplified Snapshot of my DDLs is as follows:

CREATE TABLE test_details_par( visit_id INT, visit_date DATE, store_id SMALLINT);

INSERT INTO TABLE test_details_par PARTITION(visit_date) SELECT visit_id, store_id, visit_date FROM test_details DISTRIBUTE BY visit_date;

I have tried setting these parameters, so that Hive executes my job in a better way:

set hive.exec.dynamic.partition.mode=nonstrict; 
set hive.exec.dynamic.partition=true; 
set hive.exec.max.dynamic.partitions.pernode = 10000;

Is there anything that I am missing to run the INSERT for a complete batch without specifying the dates specifically?

like image 760
Neels Avatar asked Feb 19 '14 09:02

Neels


People also ask

How do I enable dynamic partitioning in Hive?

use StudentData; Step2: Enable the dynamic partition by using the following commands: - set hive. exec. dynamic. partition=true; set hive.

What is the maximum number of partitions in Hive?

A single query cannot commit more than 500,000 (500K) partitions. To avoid this, define the time range of the query to remain within this limitation.

Which type of partitioning is allowed by default in Hive?

Dynamic Partitioning in Hive Partitions are automatically created based on the value of the last column.

When should I use dynamic partition in Hive?

Dynamic Partition takes more time in loading data compared to static partition. When you have large data stored in a table then the Dynamic partition is suitable. If you want to partition a number of columns but you don't know how many columns then also dynamic partition is suitable.


1 Answers

Neels,

Hive 12 and below have well-known scalability issues with dynamic partitioning that will be addressed with Hive 13. The problem is that Hive attempts to hold a file handle open for each and every partition it writes out, which causes out of memory and crashes. Hive 13 will sort by partition key so that it only needs to hold one file open at a time.

You have 3 options as I see

  1. Change your job to insert only a few partitions at a time.
  2. Wait for Hive 13 to be released and try that (2-3 months to wait).
  3. If you know how, build Hive from trunk and use it to complete your data load.
like image 149
Carter Shanklin Avatar answered Oct 28 '22 00:10

Carter Shanklin