Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Create hive external table from partitioned parquet files in Azure HDInsights

I have data saved as parquet files in Azure blob storage. Data is partitioned by year, month, day and hour like:

cont/data/year=2017/month=02/day=01/

I want to create external table in Hive using following create statement, which I wrote using this reference.

CREATE EXTERNAL TABLE table_name (uid string, title string, value string) 
PARTITIONED BY (year int, month int, day int) STORED AS PARQUET 
LOCATION 'wasb://cont@storage_name.blob.core.windows.net/data';

This creates table but has no rows when querying. I tried same create statement without PARTITIONED BY clause and that seems to work. So looks like issue is with partitioning.

What am I missing?

like image 272
chhantyal Avatar asked Apr 11 '17 12:04

chhantyal


People also ask

Can we partition external table in hive?

we can't perform alter on the Dynamic partition. You can perform dynamic partition on hive external table and managed table. If you want to use the Dynamic partition in the hive then the mode is in non-strict mode.

Does Hive support Parquet file format?

Parquet is supported by a plugin in Hive 0.10, 0.11, and 0.12 and natively in Hive 0.13 and later.

Can Parquet files be partitioned?

An ORC or Parquet file contains data columns. To these files you can add partition columns at write time. The data files do not store values for partition columns; instead, when writing the files you divide them into groups (partitions) based on column values.


1 Answers

After you create the partitioned table, run the following in order to add the directories as partitions

MSCK REPAIR TABLE table_name;

If you have a large number of partitions you might need to set hive.msck.repair.batch.size

When there is a large number of untracked partitions, there is a provision to run MSCK REPAIR TABLE batch wise to avoid OOME (Out of Memory Error). By giving the configured batch size for the property hive.msck.repair.batch.size it can run in the batches internally. The default value of the property is zero, it means it will execute all the partitions at once.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RecoverPartitions(MSCKREPAIRTABLE)


Written by the OP:

This will probably fix your issue, however if data is very large, it won't work. See relevant issue here.

As a workaround, there is another way to add partitions to Hive metastore one by one like:

alter table table_name add partition(year=2016, month=10, day=11, hour=11)

We wrote simple script to automate this alter statement and it seems to work for now.

like image 67
David דודו Markovitz Avatar answered Oct 31 '22 09:10

David דודו Markovitz