Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Main difference between dynamic and static partitioning in Hive

Tags:

hive

What is the main difference between static and dynamic partition in Hive? Using individual insert means static and single insert to partition table means dynamic. Is there any other advantage?

like image 622
Ronak Avatar asked Jun 18 '15 17:06

Ronak


People also ask

What is static partitioning in Hive?

In static partitioning, we partition the table based on some attribute. The attributes or columns we use to separate records are not present in the actual data we load to our table but we separate them using the partition statement available in Hive.

What is dynamic partitioning in Hive?

Dynamic partitioning is the strategic approach to load the data from the non-partitioned table where the single insert to the partition table is called a dynamic partition.

What is static and dynamic partition in spark?

From version 2.3. 0, Spark provides two modes to overwrite partitions to save data: DYNAMIC and STATIC. Static mode will overwrite all the partitions or the partition specified in INSERT statement, for example, PARTITION=20220101; dynamic mode only overwrites those partitions that have data written into it at runtime.

What is the difference between Hive partition and spark partition?

They are both chunks of data, but Spark splits data in order to process it in parallel in memory. Hive partition is in the storage, in the disk, in persistence.


2 Answers

in static partitioning we need to specify the partition column value in each and every LOAD statement.

suppose we are having partition on column country for table t1(userid, name,occupation, country), so each time we need to provide country value

hive>LOAD DATA INPATH '/hdfs path of the file' INTO TABLE t1 PARTITION(country="US")
hive>LOAD DATA INPATH '/hdfs path of the file' INTO TABLE t1 PARTITION(country="UK")

dynamic partition allow us not to specify partition column value each time. the approach we follows is as below:

  1. create a non-partitioned table t2 and insert data into it.
  2. now create a table t1 partitioned on intended column(say country).
  3. load data in t1 from t2 as below:

    hive> INSERT INTO TABLE t2 PARTITION(country) SELECT * from T1;
    
  4. make sure that partitioned column is always the last one in non partitioned table(as we are having country column in t2)

like image 142
Azam Khan Avatar answered Sep 19 '22 19:09

Azam Khan


Partitioning in Hive is very useful to prune data during query to reduce query times.

Partitions are created when data is inserted into table. Depending on how you load data you would need partitions. Usually when loading files (big files) into Hive tables static partitions are preferred. That saves your time in loading data compared to dynamic partition. You "statically" add a partition in table and move the file into the partition of the table. Since the files are big they are usually generated in HDFS. You can get the partition column value form the filename, day of date etc without reading the whole big file.

Incase of dynamic partition whole big file i.e. every row of the data is read and data is partitioned through a MR job into the destination tables depending on certain field in file. So usually dynamic partition are useful when you are doing sort of a ETL flow in your data pipeline. e.g. you load a huge file through a move command into a Table X. then you run a inert query into a Table Y and partition data based on field in table X say day , country. You may want to further run a ETL step to partition the data in country partition in Table Y into a Table Z where data is partitioned based on cities for a particular country only. etc.

Thus depending on your end table or requirements for data and in what form data is produced at source you may choose static or dynamic partition.

like image 30
Urvishsinh Mahida Avatar answered Sep 22 '22 19:09

Urvishsinh Mahida