Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does it mean to have multiple sortkey columns?

Redshift allows designating multiple columns as SORTKEY columns, but most of the best-practices documentation is written as if there were only a single SORTKEY.

If I create a table with SORTKEY (COL1, COL2), does that mean that all columns are stored sorted by COL1, then COL2? Or maybe, since it is a columnar store, each column gets stored in a different order? I.e. COL1 in COL1 order, COL2 in COL2 order, and the other columns unordered?

My situation is that I have a table with (among others) a type_id and a timestamp column. Data arrives roughly in timestamp order. Most queries are joined against / restricted by both type_id and timestamp. Usually the type_id clauses are more specific, meaning a much larger percentage of rows can be excluded by looking at the type_id clause than by looking at the timestamp clause. type_id is the DISTKEY for this reason. I'm trying to understand the pros and cons of SORTKEY (type_id), SORTKEY (stamp), SORTKEY (type_id,stamp), SORTKEY (stamp,type_id).

Thanks.

like image 776
Lorrin Avatar asked Jun 14 '13 18:06

Lorrin


People also ask

What is a Sortkey?

A sort key is a field in your table that determines the order in which the data is physically stored in the database. If you have a table of sales and you select the purchase time as the sort key, the data will be ordered from oldest to newest purchase.

What is multi column sorting?

Multicolumn sorting allows you to sort the fields one after the other. For example, if a user has three different fields rendered in the pivot grid, then it is possible to sort like: OrderBy(field1). ThenBy(field2).

What is Distkey and Sortkey in redshift?

A table's distkey is the column on which it's distributed to each node. Rows with the same value in this column are guaranteed to be on the same node. A table's sortkey is the column by which it's sorted within each node.


2 Answers

If you declare SORTKEY(COL1, COL2), all columns will be sorted by COL1, then COL2 as if ORDER BY (COL1, COL2) was done.

If you are using SORTKEY to speed up a JOIN, AFAIU it doesn't matter so long as you use the same SORTKEY on the tables that will be joined because what happens is a merge join.

If COL1 is highly selective like your type_id, it means there are only small numbers of rows which has the same type_id. Therefore although you can add another column to SORTKEY, its utility is limited because most of the row elimination has already happened.

If COL1 is not highly selective like your stamp (which is a bit weird btw; I would have expected it to be more selective than type_id? Anyways..), it means that filtering by stamp won't eliminate that much rows. So it makes more sense to declare a second sort key. However, this is less efficient than the other way around as eliminating rows earlier would be cheaper. If you sometimes filter by stamp but not by type_id, it may make sense to do this though.

like image 95
Enno Shioji Avatar answered Nov 26 '22 06:11

Enno Shioji


We are also using Redshift and we have about 2 billion records (+20 million every day) and I have to say, the less selective the sort_key is, the more ahead it should be in the sort_key list.

In our case (and please be advised to analyze how you use/query your own data) we used timestamp as first sort_key. The problem with this is, that even within 1 second we record about 200 rows, which results our 1MB blocks contain only a few seconds, and every type of data in that single block. Meaning, even though timestamp is highly selective, after we cannot really filter further as we have all kinds of data in every block.

Recently we have reversed the order of the sort_keys. The first one has about 15 different values, the second has about 30, etc... and timestamp is the last one now, but still, one block is still measured in seconds.

This results, (since we use the first two sort_keys as filters very frequently) the following: Old solution: A year of data, select a month, it drops 91% of the blocks, but after it has to open all of them, even though we want to filter further.

The new solution drops about 14/15 of the blocks in the first step, regardless of the date range, then about 95% of the remaining ones, and timestamp still drops 91% of the remaining ones.

We have tested it thoroughly with two, 800 million records tables, which were the same, except the order of the sort keys. The higher the time-period in the 'where' clause was, the better results we got. It got even more significant in case of joins obviously.

So my suggestion is, know your database and what kind of queries you run frequently, because the most selective column might not be the best first sort_key. Just as Enno Shioji said, it all depends on by what you are filtering.

like image 32
user318581 Avatar answered Nov 26 '22 05:11

user318581