We currently generate a daily CSV export that we upload to an S3 bucket, into the following structure:
<report-name>
|--reportDate-<date-stamp>
|-- part0.csv.gz
|-- part1.csv.gz
We want to be able to run reports partitioned by daily export.
According to this page, you can partition data in Redshift Spectrum by a key which is based on the source S3 folder where your Spectrum table sources its data. However, from the example, it looks like you need an ALTER
statement for each partition:
alter table spectrum.sales_part
add partition(saledate='2008-01-01')
location 's3://bucket/tickit/spectrum/sales_partition/saledate=2008-01/';
alter table spectrum.sales_part
add partition(saledate='2008-02-01')
location 's3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-02/';
Is there any way to set the table up so that data is automatically partitioned by the folder it comes from, or do we need a daily job to ALTER
the table to add that day's partition?
Amazon Redshift Spectrum supports table partitioning using the CREATE EXTERNAL TABLE command. Only a subset of ALTER COLUMN actions are supported.
Redshift Spectrum can be more consistent performance-wise while querying in Athena can be slow during peak hours since it runs on pooled resources. Redshift Spectrum is more suitable for running large, complex queries, while Athena is more suited for simplifying interactive queries.
A COPY command is the most efficient way to load a table. You can also add data to your tables using INSERT commands, though it is much less efficient than using COPY. The COPY command is able to read from multiple data files or multiple data streams simultaneously.
If you are using CREATE EXTERNAL TABLE AS, you don't need to run ALTER TABLE ... ADD PARTITION . Amazon Redshift automatically registers new partitions in the external catalog.
Solution 1:
At max 20000 partitions can be created per table. You can create a one-time script to add the partitions (at max 20k) for all the future s3 partition folders.
For eg.
If folder s3://bucket/ticket/spectrum/sales_partition/saledate=2017-12/ doesn't exist, you can even add partition for that.
alter table spectrum.sales_part
add partition(saledate='2017-12-01')
location 's3://bucket/tickit/spectrum/sales_partition/saledate=2017-12/';
Solution 2:
https://aws.amazon.com/blogs/big-data/data-lake-ingestion-automatically-partition-hive-external-tables-with-aws/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With