Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Partition columns when inserting into a Hive table from a select

Tags:

hadoop

hive

I was researching partitions in Hive and came upon:

http://www.brentozar.com/archive/2013/03/introduction-to-hive-partitioning/ In this link, the author says: “When inserting data into a partition, it’s necessary to include the partition columns as the last columns in the query. Thecolumn names in the source query don’t need to match the partition column names, but they really do need to be last – there’s no way to wire up Hive differently”

I have a query like:

insert overwrite table MyDestTable PARTITION (partition_date)
select
grid.partition_date,
….

I have the above query that has been running for a while without errors. As you can see, I am selecting the partition column as the very first column. Is it wrong? I have tried to corroborate the author’s statement from other sources but am not finding other documents that say the same. Does anybody here know what the right thing to do is? From my end, being a Hive newbie, I am just going by whether Hive is complaining or not (which it is not).

KS

like image 555
KS1234 Avatar asked Feb 08 '14 20:02

KS1234


People also ask

How will you insert into a partitioned table in Hive from another table?

If your partition column is not at the end then you need to do following. Create another table without partition. Load data into the table (Assume state is at first column). Insert into the partitioned table by selecting columns from the non-partitioned table (make sure you select state at the end).

How do I choose a partition column in Hive?

In hive, you create a table based on the usage pattern and so you should choose both partitioning the bucketing based on what your Analysis Queries would look like. Partitioning helps you speed up the queries with predicates (i.e. Where conditions).

How partitions on the Hive table can be shown?

You can see Hive MetaStore tables,Partitions information in table of "PARTITIONS". You could use "TBLS" join "Partition" to query special table partitions.


3 Answers

example:

set hive.exec.dynamic.partition=true;  
set hive.exec.dynamic.partition.mode=nonstrict;  

drop table tmp.table1;

create table tmp.table1(  
col_a string,col_b int)  
partitioned by (ptdate string,ptchannel string)  
row format delimited  
fields terminated by '\t' ;  

insert overwrite table tmp.table1 partition(ptdate,ptchannel)  
select col_a,count(1) col_b,ptdate,ptchannel
from tmp.table2
group by ptdate,ptchannel,col_a ;
like image 123
anyoneking Avatar answered Oct 03 '22 02:10

anyoneking


Yes, It is mandatory to use partitioned column as last column while inserting the data. Make sure PARTITIONED BY column shouldn't be an existing column in the table.Hive will take care the rest.

CREATE EXTERNAL TABLE temp (
DATA_OWNER STRING,
DISTRICT_CODE STRING,
BILLING_ACCOUNT_NO STRING,
INST_COUNTY STRING,
INST_POST_CODE STRING,
INST_STATUS STRING,
INST_EXCHANGE_GROUP_CODE STRING,
EXCHANGE_CODE STRING
) PARTITIONED BY (TS_LAST_UPDATED STRING)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '\001'
STORED AS TEXTFILE LOCATION 'user/entity/site/inbound/CSS_INSTALLATION_PARTITIONED';

INSERT OVERWRITE TABLE temp PARTITION (TS_LAST_UPDATED)
SELECT
DATA_OWNER,
DISTRICT_CODE,
BILLING_ACCOUNT_NO,
INST_COUNTY,
INST_POST_CODE,
INST_STATUS,
INST_EXCHANGE_GROUP_CODE,
EXCHANGE_CODE,TO_DATE(TS_LAST_UPDATED) FROM temp1 
like image 26
Satheesh Avatar answered Oct 03 '22 04:10

Satheesh


The dynamic partition columns must be specified last among the columns in the SELECT statement and in the same order in which they appear in the PARTITION() clause.

see hive wiki for more information.

like image 44
pensz Avatar answered Oct 03 '22 02:10

pensz