We currently generate a daily CSV export that we upload to an S3 bucket, into the following structure: <pre class="prettyprint"><code><report-name> |--reportDate-<date-stamp> |-- part0.csv.gz |-- part1.csv.gz </code></pre> We want to be able to run reports partitioned by daily export. According to this page, you can partition data in Redshift Spectrum by a key which is based on the source S3 folder where your Spectrum table sources its data. However, from the example, it looks like you need an <code>ALTER</code> statement for each partition: <pre class="prettyprint"><code>alter table spectrum.sales_part add partition(saledate='2008-01-01') location 's3://bucket/tickit/spectrum/sales_partition/saledate=2008-01/'; alter table spectrum.sales_part add partition(saledate='2008-02-01') location 's3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-02/'; </code></pre> Is there any way to set the table up so that data is automatically partitioned by the folder it comes from, or do we need a daily job to <code>ALTER</code> the table to add that day's partition?

Solution 1: At max 20000 partitions can be created per table. You can create a one-time script to add the partitions (at max 20k) for all the future s3 partition folders. For eg. If folder s3://bucket/ticket/spectrum/sales_partition/saledate=2017-12/ doesn't exist, you can even add partition for that. <pre class="prettyprint"><code>alter table spectrum.sales_part add partition(saledate='2017-12-01') location 's3://bucket/tickit/spectrum/sales_partition/saledate=2017-12/'; </code></pre> Solution 2: https://aws.amazon.com/blogs/big-data/data-lake-ingestion-automatically-partition-hive-external-tables-with-aws/

Redshift Spectrum: Automatically partition tables by date/folder

Tags:

amazon-s3

amazon-redshift

amazon-redshift-spectrum

We currently generate a daily CSV export that we upload to an S3 bucket, into the following structure:

<report-name>
|--reportDate-<date-stamp>
    |-- part0.csv.gz
    |-- part1.csv.gz

We want to be able to run reports partitioned by daily export.

According to this page, you can partition data in Redshift Spectrum by a key which is based on the source S3 folder where your Spectrum table sources its data. However, from the example, it looks like you need an ALTER statement for each partition:

alter table spectrum.sales_part
add partition(saledate='2008-01-01') 
location 's3://bucket/tickit/spectrum/sales_partition/saledate=2008-01/';

alter table spectrum.sales_part
add partition(saledate='2008-02-01') 
location 's3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-02/';

Is there any way to set the table up so that data is automatically partitioned by the folder it comes from, or do we need a daily job to ALTER the table to add that day's partition?

754

asked Nov 08 '17 16:11

GoatInTheMachine

1 Answers

Solution 1:

At max 20000 partitions can be created per table. You can create a one-time script to add the partitions (at max 20k) for all the future s3 partition folders.

For eg.

If folder s3://bucket/ticket/spectrum/sales_partition/saledate=2017-12/ doesn't exist, you can even add partition for that.

alter table spectrum.sales_part
add partition(saledate='2017-12-01') 
location 's3://bucket/tickit/spectrum/sales_partition/saledate=2017-12/';

Solution 2:

https://aws.amazon.com/blogs/big-data/data-lake-ingestion-automatically-partition-hive-external-tables-with-aws/

143

answered Sep 23 '22 20:09

Sumit Saurabh

Related questions
                            
                                Can upload files but can't list S3 bucket objects. Get access denied error
                            
                                AWS S3 alternatives for private cloud
                            
                                Can I move an object into a 'folder' inside an S3 bucket using the s3cmd mv command?
                            
                                Amazon S3 Object Replication
                            
                                How to find/check current permissions in AWS S3 using cli?
                            
                                Using S3 (Frankfurt) with Spark
                            
                                Athena query results at specific path on S3
                            
                                How to save sklearn model on s3 using joblib.dump?
                            
                                Using amazon S3 to host remote Hg Repositories
                            
                                Concurrency in Amazon S3
                            
                                Amazon AWSClientFactory does not exists
                            
                                Upload to S3 with progress in plain Ruby script
                            
                                Costs of enabling versioning in Amazon S3
                            
                                Is it possible to transfer ownership of objects to another user using the Amazon S3?
                            
                                MongoDB backup strategy for AWS
                            
                                amazon s3 for downloads how to handle security
                            
                                AWS S3 Upload image to Bucket iOS app
                            
                                S3 using boto and SigV4 - missing host parameter
                            
                                Difference between Amazon S3 cross region replication and Cloudfront
                            
                                Subfolder redirect issue with static website hosting using S3, CloudFront and Origin Path

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With