What is the main difference between static and dynamic partition in Hive? Using individual insert means static and single insert to partition table means dynamic. Is there any other advantage?
In static partitioning, we partition the table based on some attribute. The attributes or columns we use to separate records are not present in the actual data we load to our table but we separate them using the partition statement available in Hive.
Dynamic partitioning is the strategic approach to load the data from the non-partitioned table where the single insert to the partition table is called a dynamic partition.
From version 2.3. 0, Spark provides two modes to overwrite partitions to save data: DYNAMIC and STATIC. Static mode will overwrite all the partitions or the partition specified in INSERT statement, for example, PARTITION=20220101; dynamic mode only overwrites those partitions that have data written into it at runtime.
They are both chunks of data, but Spark splits data in order to process it in parallel in memory. Hive partition is in the storage, in the disk, in persistence.
in static partitioning we need to specify the partition column value in each and every LOAD statement.
suppose we are having partition on column country for table t1(userid, name,occupation, country), so each time we need to provide country value
hive>LOAD DATA INPATH '/hdfs path of the file' INTO TABLE t1 PARTITION(country="US")
hive>LOAD DATA INPATH '/hdfs path of the file' INTO TABLE t1 PARTITION(country="UK")
dynamic partition allow us not to specify partition column value each time. the approach we follows is as below:
load data in t1 from t2 as below:
hive> INSERT INTO TABLE t2 PARTITION(country) SELECT * from T1;
make sure that partitioned column is always the last one in non partitioned table(as we are having country column in t2)
Partitioning in Hive is very useful to prune data during query to reduce query times.
Partitions are created when data is inserted into table. Depending on how you load data you would need partitions. Usually when loading files (big files) into Hive tables static partitions are preferred. That saves your time in loading data compared to dynamic partition. You "statically" add a partition in table and move the file into the partition of the table. Since the files are big they are usually generated in HDFS. You can get the partition column value form the filename, day of date etc without reading the whole big file.
Incase of dynamic partition whole big file i.e. every row of the data is read and data is partitioned through a MR job into the destination tables depending on certain field in file. So usually dynamic partition are useful when you are doing sort of a ETL flow in your data pipeline. e.g. you load a huge file through a move command into a Table X. then you run a inert query into a Table Y and partition data based on field in table X say day , country. You may want to further run a ETL step to partition the data in country partition in Table Y into a Table Z where data is partitioned based on cities for a particular country only. etc.
Thus depending on your end table or requirements for data and in what form data is produced at source you may choose static or dynamic partition.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With