Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Offloading data files from Amazon Redshift to Amazon S3 in Parquet format

I would like to unload data files from Amazon Redshift to Amazon S3 in Apache Parquet format inorder to query the files on S3 using Redshift Spectrum. I have explored every where but I couldn't find anything about how to offload the files from Amazon Redshift to S3 using Parquet format. Is this feature not supported yet or was I not able to find any documentation about it. Could somebody who has worked on it share some light on this? Thank you.

like image 872
Teja Avatar asked Feb 07 '18 21:02

Teja


People also ask

How to import parquet data from Amazon S3 to Amazon Redshift?

You need to have your data inside your Amazon S3 Bucket first and only from there, you can import it to your Amazon Redshift Data Warehouse using Amazon Redshift Data Pipeline. This method is majorly divided into 2 steps as follows: Follow the steps below to create an Amazon S3 Bucket and upload the Parquet file to that bucket:

How do I unload data from Amazon Redshift to S3?

Unloading Data to Amazon S3. Amazon Redshift splits the results of a select statement across a set of files, one or more files per node slice, to simplify parallel reloading of the data. Alternatively, you can specify that UNLOAD should write the results serially to one or more files by adding the PARALLEL OFF option.

Can redshift read parquet data?

Amazon Redshift Can Now COPY from Parquet and ORC File Formats. You can now COPY Apache Parquet and Apache ORC file formats from Amazon S3 to your Amazon Redshift cluster. Apache Parquet and ORC are columnar data formats that allow users to store their data more efficiently and cost-effectively.

What is the use of UNLOAD command in Amazon Redshift?

Amazon Redshift supports the “ UNLOAD ” command which takes the result of a query, and stores the data in Amazon S3. This command works opposite to the “ COPY ” command where it grabs the data from an Amazon S3 bucket and puts it into an Amazon Redshift table.


2 Answers

Redshift Unload to Parquet file format is supported as of Dec 2019:

UNLOAD ('select-statement')
TO 's3://object-path/name-prefix'
FORMAT PARQUET

It is mentioned in Redshift Features

and also updated in Unload Document

with an example provided in the Unload Examples Document

Excerpt of the official documentation:

The following example unloads the LINEITEM table in Parquet format, partitioned by the l_shipdate column.

unload ('select * from lineitem')
to 's3://mybucket/lineitem/'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
PARQUET
PARTITION BY (l_shipdate);

Assuming four slices, the resulting Parquet files are dynamically partitioned into various folders.

s3://mybucket/lineitem/l_shipdate=1992-01-02/0000_part_00.parquet
                                             0001_part_00.parquet
                                             0002_part_00.parquet
                                             0003_part_00.parquet
s3://mybucket/lineitem/l_shipdate=1992-01-03/0000_part_00.parquet
                                             0001_part_00.parquet
                                             0002_part_00.parquet
                                             0003_part_00.parquet
s3://mybucket/lineitem/l_shipdate=1992-01-04/0000_part_00.parquet
                                             0001_part_00.parquet
                                             0002_part_00.parquet
                                             0003_part_00.parquet
like image 146
secdatabase Avatar answered Oct 18 '22 23:10

secdatabase


You can't do this. Redshift doesn't know about Parquet (although you can read Parquet files through the Spectrum abstraction).

You can UNLOAD to text files. They can be encrypted or zipped, but they are only ever flat text files.


Looks like this is now supported:

https://aws.amazon.com/about-aws/whats-new/2018/06/amazon-redshift-can-now-copy-from-parquet-and-orc-file-formats/

like image 25
Kirk Broadhurst Avatar answered Oct 19 '22 00:10

Kirk Broadhurst