Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Impala - convert existing table to parquet format

I have a table that has partitions and I use avro files or text files to create and insert into a table.

Once the table is done, is there a way to convert into parquet. I mean I know we could have done say CREATE TABLE default.test( name_id STRING) PARTITIONED BY ( year INT, month INT, day INT ) STORED AS PARQUET initially while creating the table itself.
In my use case I 'll have to use textfiles initially. This is because I want to avoid creating multiple files inside of partition folders everytime I insert or update. My table has a very high number of inserts and updates and this is creating a drop in performance. Is there a way I could convert into parquet after the table is created and data inserted?

like image 390
user1189851 Avatar asked Oct 14 '14 16:10

user1189851


People also ask

How do you make a Parquet table on an Impala?

Creating Parquet Tables in Impala [impala-host:21000] > create table parquet_table_name (x INT, y STRING) STORED AS PARQUET; Or, to clone the column names and data types of an existing table: [impala-host:21000] > create table parquet_table_name LIKE other_table_name STORED AS PARQUET; In Impala 1.4.

How do I manually create a parquet file?

It requires winutils on Windows. Download and set environment value. Clone parquet-mr, build all and run 'convert-csv' command of parquet-cli. 'cat' command shows the content of that parquet file.

Is Parquet more efficient than CSV?

Summary of technical features of parquet filesApache Parquet is column-oriented and designed to provide efficient columnar storage compared to row-based file types such as CSV. Parquet files were designed with complex nested data structures in mind.


1 Answers

You can create a table on your data in hdfs which can be stored as text, avro, or whatever format.

Then you can create another table using:

CREATE TABLE x_parquet LIKE x_non_parquet STORED AS PARQUET;

You can then set compression to something like snappy or gzip:

SET PARQUET_COMPRESSION_CODEC=snappy;

Then you can get data from the non parquet table and insert it into the new parquet backed table:

INSERT INTO x_parquet select * from x_non_parquet;

Now if you want to save space and avoid confusion, I'd automate this for any data ingestion and then delete the original non parquet format. This will help your queries run faster and cause your data to take up less space.

like image 99
Ray Avatar answered Jan 01 '23 09:01

Ray