Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Schema Evolution in Parquet Hive table

Tags:

hive

parquet

I have a lot of data in a Parquet based Hive table (Hive version 0.10). I have to add a few new columns to the table. I want the new columns to have data going forward. If the value is NULL for already loaded data, that is fine with me.

If I add the new columns and not update the old Parquet files, it gives an error and it looks strange as I am adding String columns only.

Error getting row data with exception java.lang.UnsupportedOperationException: Cannot inspect java.util.ArrayList

Can you please tell me how to add new fields to Parquet Hive without affecting the already existing data in the table ?

I use Hive version 0.10.

Thanks.

like image 767
user1717720 Avatar asked Jan 09 '15 20:01

user1717720


Video Answer


2 Answers

1) Hive starting with version 0.13 has parquet schema evoultion built in.

https://issues.apache.org/jira/browse/HIVE-6456

https://github.com/Parquet/parquet-mr/pull/297

ps. Notice that out-of-the-box support for schema evolution might take a toll on performance. For example, Spark has a knob to turn parquet schema evolution on and off. After one of the recent Spark releases, it's now off by default because of performance hit (epscially when there are a lot of parquet files). Not sure if Hive 0.13+ has such a setting too.

2) Also wanted to suggest to try creating views in Hive on top of such parquet tables where you expect often schema changes, and use views everywhere but not tables directly.

For example, if you have two tables - A and B with compatible schemas, but table B has two more columns, you could workaround this by

CREATE VIEW view_1 AS
SELECT col1,col2,col3,null as col4,null as col5 FROM tableA
UNION ALL
SELECT col1,col2,col3,col4,col5 FROM tableB
;

So you don't actually have to recreate any tables like @miljanm has suggested, you can just recreate the view. It'll help with the agility of your projects.

like image 153
Tagar Avatar answered Sep 28 '22 10:09

Tagar


Create a new table with the two new columns. Insert data by issuing:

insert into new_table select old_table.col1, old_table.col2,...,null,null from old_table;

The last two nulls are for the two new columns. That's it.

If you have too many columns, it may be easier for you to write a program that reads the old files and writes the new ones.

Hive 0.10 does not have support for schema evolution in parquet as far as I know. Hive 0.13 does have it, so you may try to upgrade hive.

like image 45
miljanm Avatar answered Sep 28 '22 10:09

miljanm