Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Load Parquet files into Redshift

I have a bunch of Parquet files on S3, i want to load them into redshift in most optimal way.

Each file is split into multiple chunks......what is the most optimal way to load data from S3 into Redshift?

Also, how do you create the target table definition in Redshift? Is there a way to infer schema from Parquet and create table programatically? I believe there is a way to do this using Redshift spectrum, but i want to know if this can be done in scripting.

Appreciate your help!

I am considering all AWS tools such as Glue, Lambda etc to do this the most optimal way(in terms of performance, security and cost).

like image 674
Richard Avatar asked Sep 05 '18 23:09

Richard


People also ask

Can we load Parquet files in Redshift?

Apache Parquet and ORC are columnar data formats that allow users to store their data more efficiently and cost-effectively. With this update, Redshift now supports COPY from six file formats: AVRO, CSV, JSON, Parquet, ORC and TXT.

How do I run a parquet file?

If there is a pre-existing file association, right click on any . parquet file, select Open With ... Choose Another App and select parquetfile .


Video Answer


1 Answers

The Amazon Redshift COPY command can natively load Parquet files by using the parameter:

FORMAT AS PARQUET

See: Amazon Redshift Can Now COPY from Parquet and ORC File Formats

The table must be pre-created; it cannot be created automatically.

Also note from COPY from Columnar Data Formats - Amazon Redshift:

COPY inserts values into the target table's columns in the same order as the columns occur in the columnar data files. The number of columns in the target table and the number of columns in the data file must match.

like image 105
John Rotenstein Avatar answered Oct 17 '22 13:10

John Rotenstein