I'm trying to figure out the different types of sortkeys in Amazon Redshift and I encountered a strange warning here, which is not explained:
Important: Don’t use an interleaved sort key on columns with monotonically increasing attributes, such as identity columns, dates, or timestamps.
And yet, in their own example, Amazon uses interleaved key on a date column with good performance.
So, my question is - what's the explanation to this warning and should I take it seriously? More precisely - is there a problem with using interleaved key over a timestamp column?
Interleaved sort key. An interleaved sort gives equal weight to each column, or subset of columns, in the sort key. If multiple queries use different columns for filters, then you can often improve performance for those queries by using an interleaved sort style.
Amazon Redshift supports two different types of Sort Keys, Compound Sort Keys, and Interleaved Sort Keys. Selecting the right kind requires knowledge of the queries that you plan to execute.
A table's distkey is the column on which it's distributed to each node. Rows with the same value in this column are guaranteed to be on the same node. A table's sortkey is the column by which it's sorted within each node.
Redshift Sort Key determines the order in which rows in a table are stored. Query performance is improved when Sort keys are properly used as it enables the query optimizer to read fewer chunks of data filtering out the majority of it.
If recent data is queried most frequently, specify the timestamp column as the leading column for the sort key. Queries are more efficient because they can skip entire blocks that fall outside the time range.
An interleaved sort key gives equal weight to each column in the sort key, so query predicates can use any subset of the columns that make up the sort key, in any order. To understand the impact of the chosen sort key on query performance, use the EXPLAIN command.
Interleaved sorts are most effective with highly selective queries that filter on one or more of the sort key columns in the WHERE clause, for example select c_name from customer where c_region = 'ASIA'.
You can specify either a compound or interleaved sort key. A compound sort key is more efficient when query predicates use a prefix, which is a subset of the sort key columns in order.
I think it might have been explained later on when they describe issues around vacuuming/reindexing:
When tables are initially loaded, Amazon Redshift analyzes the distribution of the values in the sort key columns and uses that information for optimal interleaving of the sort key columns. As a table grows, the distribution of the values in the sort key columns can change, or skew, especially with date or timestamp columns. If the skew becomes too large, performance might be affected.
So if that is the only reason, then it just means you will have increased maintenance on index.
From https://docs.aws.amazon.com/redshift/latest/dg/t_Sorting_data.html
As you add rows to a sorted table that already contains data, the unsorted region grows, which has a significant effect on performance. The effect is greater when the table uses interleaved sorting, especially when the sort columns include data that increases monotonically, such as date or timestamp columns.
The key point in the original quote is not that that data is a date or timestamp, it's that it increases "monotonically", which in this context presumably means increasing sequentially such as an event timestamp or an Id number.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With