Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why not to use timestamp with Interleaved Sortkey?

I'm trying to figure out the different types of sortkeys in Amazon Redshift and I encountered a strange warning here, which is not explained:

Important: Don’t use an interleaved sort key on columns with monotonically increasing attributes, such as identity columns, dates, or timestamps.

And yet, in their own example, Amazon uses interleaved key on a date column with good performance.

So, my question is - what's the explanation to this warning and should I take it seriously? More precisely - is there a problem with using interleaved key over a timestamp column?

like image 409
senior_citizen_ Avatar asked Feb 11 '19 13:02

senior_citizen_


People also ask

What is interleaved Sortkey?

Interleaved sort key. An interleaved sort gives equal weight to each column, or subset of columns, in the sort key. If multiple queries use different columns for filters, then you can often improve performance for those queries by using an interleaved sort style.

How many sort keys can redshift have?

Amazon Redshift supports two different types of Sort Keys, Compound Sort Keys, and Interleaved Sort Keys. Selecting the right kind requires knowledge of the queries that you plan to execute.

What is a Distkey?

A table's distkey is the column on which it's distributed to each node. Rows with the same value in this column are guaranteed to be on the same node. A table's sortkey is the column by which it's sorted within each node.

What is redshift Sortkey?

Redshift Sort Key determines the order in which rows in a table are stored. Query performance is improved when Sort keys are properly used as it enables the query optimizer to read fewer chunks of data filtering out the majority of it.

What is the best way to sort data based on timestamp?

If recent data is queried most frequently, specify the timestamp column as the leading column for the sort key. Queries are more efficient because they can skip entire blocks that fall outside the time range.

How do interleaved sort keys affect query performance?

An interleaved sort key gives equal weight to each column in the sort key, so query predicates can use any subset of the columns that make up the sort key, in any order. To understand the impact of the chosen sort key on query performance, use the EXPLAIN command.

When to use interleaved sorting?

Interleaved sorts are most effective with highly selective queries that filter on one or more of the sort key columns in the WHERE clause, for example select c_name from customer where c_region = 'ASIA'.

What are compound and interleaved sort keys?

You can specify either a compound or interleaved sort key. A compound sort key is more efficient when query predicates use a prefix, which is a subset of the sort key columns in order.


Video Answer


2 Answers

I think it might have been explained later on when they describe issues around vacuuming/reindexing:

When tables are initially loaded, Amazon Redshift analyzes the distribution of the values in the sort key columns and uses that information for optimal interleaving of the sort key columns. As a table grows, the distribution of the values in the sort key columns can change, or skew, especially with date or timestamp columns. If the skew becomes too large, performance might be affected.

So if that is the only reason, then it just means you will have increased maintenance on index.

like image 121
Łukasz Kamiński Avatar answered Oct 14 '22 21:10

Łukasz Kamiński


From https://docs.aws.amazon.com/redshift/latest/dg/t_Sorting_data.html

As you add rows to a sorted table that already contains data, the unsorted region grows, which has a significant effect on performance. The effect is greater when the table uses interleaved sorting, especially when the sort columns include data that increases monotonically, such as date or timestamp columns.

The key point in the original quote is not that that data is a date or timestamp, it's that it increases "monotonically", which in this context presumably means increasing sequentially such as an event timestamp or an Id number.

like image 30
Nathan Griffiths Avatar answered Oct 14 '22 19:10

Nathan Griffiths