Redshift documentation identifies time-series tables as a best practice: http://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-time-series-tables.html
However, it doesn't address any of the following issues:
My situation:
My proposed solution:
EDITS: added notes identifying which issues have been answered, and added some detail to the proposed solution.
Your proposed process sounds quite good! While I can't answer all your questions, here is some information:
Any method of writing to the union-all view and having redshift direct those inserts to the correct underlying tables?
Views are read-only. It is not possible to write to a view, nor is it possible to insert data while expecting Redshift to send it to an appropriate table (eg a specific table for the given day).
Any way to enable redshift to eliminate some underlying partitions (tables) when querying the union-all view if their date range is outside of a query's criteria?
Redshift will not exclude specific tables from the query, but it will avoid reading particular disk blocks through the use of Zone Maps. Each block of data written to disk is associated with a specific table and column. The block has a Zone Map, which indicates the minimum and maximum values of that field stored within the block.
If a query includes a WHERE
clause, Redshift can skip blocks that do not contain relevant data. This is particularly powerful when used on the SORTKEY
column, since similar ranges of data are grouped together.
Given that you are using a date as the SORTKEY
, Redshift will read very few disk blocks if the query includes a WHERE
clause based on that column. This is very similar to the idea of skipping tables, but it actually skips reading disk blocks.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With