Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Redshift Spectrum: Automatically partition tables by date/folder

We currently generate a daily CSV export that we upload to an S3 bucket, into the following structure:

<report-name>
|--reportDate-<date-stamp>
    |-- part0.csv.gz
    |-- part1.csv.gz

We want to be able to run reports partitioned by daily export.

According to this page, you can partition data in Redshift Spectrum by a key which is based on the source S3 folder where your Spectrum table sources its data. However, from the example, it looks like you need an ALTER statement for each partition:

alter table spectrum.sales_part
add partition(saledate='2008-01-01') 
location 's3://bucket/tickit/spectrum/sales_partition/saledate=2008-01/';

alter table spectrum.sales_part
add partition(saledate='2008-02-01') 
location 's3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-02/';

Is there any way to set the table up so that data is automatically partitioned by the folder it comes from, or do we need a daily job to ALTER the table to add that day's partition?

like image 754
GoatInTheMachine Avatar asked Nov 08 '17 16:11

GoatInTheMachine


People also ask

Does redshift support table partitioning?

Amazon Redshift Spectrum supports table partitioning using the CREATE EXTERNAL TABLE command. Only a subset of ALTER COLUMN actions are supported.

Is redshift Spectrum faster than Athena?

Redshift Spectrum can be more consistent performance-wise while querying in Athena can be slow during peak hours since it runs on pooled resources. Redshift Spectrum is more suitable for running large, complex queries, while Athena is more suited for simplifying interactive queries.

What is the most efficient and fastest way to load data into redshift?

A COPY command is the most efficient way to load a table. You can also add data to your tables using INSERT commands, though it is much less efficient than using COPY. The COPY command is able to read from multiple data files or multiple data streams simultaneously.

Can we apply partitioning on external tables redshift?

If you are using CREATE EXTERNAL TABLE AS, you don't need to run ALTER TABLE ... ADD PARTITION . Amazon Redshift automatically registers new partitions in the external catalog.


1 Answers

Solution 1:

At max 20000 partitions can be created per table. You can create a one-time script to add the partitions (at max 20k) for all the future s3 partition folders.

For eg.

If folder s3://bucket/ticket/spectrum/sales_partition/saledate=2017-12/ doesn't exist, you can even add partition for that.

alter table spectrum.sales_part
add partition(saledate='2017-12-01') 
location 's3://bucket/tickit/spectrum/sales_partition/saledate=2017-12/';

Solution 2:

https://aws.amazon.com/blogs/big-data/data-lake-ingestion-automatically-partition-hive-external-tables-with-aws/

like image 143
Sumit Saurabh Avatar answered Sep 23 '22 20:09

Sumit Saurabh