I have a table that has partitions and I use avro files or text files to create and insert into a table.
Once the table is done, is there a way to convert into parquet.
I mean I know we could have done say CREATE TABLE default.test( name_id STRING)
PARTITIONED BY ( year INT, month INT, day INT ) STORED AS PARQUET
initially while creating the table itself.
In my use case I 'll have to use textfiles initially. This is because I want to avoid creating multiple files inside of partition folders everytime I insert or update. My table has a very high number of inserts and updates and this is creating a drop in performance.
Is there a way I could convert into parquet after the table is created and data inserted?
Creating Parquet Tables in Impala [impala-host:21000] > create table parquet_table_name (x INT, y STRING) STORED AS PARQUET; Or, to clone the column names and data types of an existing table: [impala-host:21000] > create table parquet_table_name LIKE other_table_name STORED AS PARQUET; In Impala 1.4.
It requires winutils on Windows. Download and set environment value. Clone parquet-mr, build all and run 'convert-csv' command of parquet-cli. 'cat' command shows the content of that parquet file.
Summary of technical features of parquet filesApache Parquet is column-oriented and designed to provide efficient columnar storage compared to row-based file types such as CSV. Parquet files were designed with complex nested data structures in mind.
You can create a table on your data in hdfs which can be stored as text, avro, or whatever format.
Then you can create another table using:
CREATE TABLE x_parquet LIKE x_non_parquet STORED AS PARQUET;
You can then set compression to something like snappy or gzip:
SET PARQUET_COMPRESSION_CODEC=snappy;
Then you can get data from the non parquet table and insert it into the new parquet backed table:
INSERT INTO x_parquet select * from x_non_parquet;
Now if you want to save space and avoid confusion, I'd automate this for any data ingestion and then delete the original non parquet format. This will help your queries run faster and cause your data to take up less space.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With