I would like to unload data files from Amazon Redshift to Amazon S3 in Apache Parquet format inorder to query the files on S3 using Redshift Spectrum. I have explored every where but I couldn't find anything about how to offload the files from Amazon Redshift to S3 using Parquet format. Is this feature not supported yet or was I not able to find any documentation about it. Could somebody who has worked on it share some light on this? Thank you.
You need to have your data inside your Amazon S3 Bucket first and only from there, you can import it to your Amazon Redshift Data Warehouse using Amazon Redshift Data Pipeline. This method is majorly divided into 2 steps as follows: Follow the steps below to create an Amazon S3 Bucket and upload the Parquet file to that bucket:
Unloading Data to Amazon S3. Amazon Redshift splits the results of a select statement across a set of files, one or more files per node slice, to simplify parallel reloading of the data. Alternatively, you can specify that UNLOAD should write the results serially to one or more files by adding the PARALLEL OFF option.
Amazon Redshift Can Now COPY from Parquet and ORC File Formats. You can now COPY Apache Parquet and Apache ORC file formats from Amazon S3 to your Amazon Redshift cluster. Apache Parquet and ORC are columnar data formats that allow users to store their data more efficiently and cost-effectively.
Amazon Redshift supports the “ UNLOAD ” command which takes the result of a query, and stores the data in Amazon S3. This command works opposite to the “ COPY ” command where it grabs the data from an Amazon S3 bucket and puts it into an Amazon Redshift table.
Redshift Unload to Parquet file format is supported as of Dec 2019:
UNLOAD ('select-statement')
TO 's3://object-path/name-prefix'
FORMAT PARQUET
It is mentioned in Redshift Features
and also updated in Unload Document
with an example provided in the Unload Examples Document
Excerpt of the official documentation:
The following example unloads the LINEITEM table in Parquet format, partitioned by the l_shipdate column.
unload ('select * from lineitem')
to 's3://mybucket/lineitem/'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
PARQUET
PARTITION BY (l_shipdate);
Assuming four slices, the resulting Parquet files are dynamically partitioned into various folders.
s3://mybucket/lineitem/l_shipdate=1992-01-02/0000_part_00.parquet
0001_part_00.parquet
0002_part_00.parquet
0003_part_00.parquet
s3://mybucket/lineitem/l_shipdate=1992-01-03/0000_part_00.parquet
0001_part_00.parquet
0002_part_00.parquet
0003_part_00.parquet
s3://mybucket/lineitem/l_shipdate=1992-01-04/0000_part_00.parquet
0001_part_00.parquet
0002_part_00.parquet
0003_part_00.parquet
You can't do this. Redshift doesn't know about Parquet (although you can read Parquet files through the Spectrum abstraction).
You can UNLOAD
to text files. They can be encrypted or zipped, but they are only ever flat text files.
Looks like this is now supported:
https://aws.amazon.com/about-aws/whats-new/2018/06/amazon-redshift-can-now-copy-from-parquet-and-orc-file-formats/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With