I am using Amazon Redshift to store a relationship table connected to a huge tables of logs.
The schema should look like:
CREATE TABLE public.my_table (
id INT IDENTITY(1,1),
identifier INTEGER NOT NULL encode lzo DISTKEY,
foreign_id VARCHAR(36) NOT NULL encode runlength
)
SORTKEY(foreign_id);
My question is: Can I apply encoding to the column used as DISTKEY (and by extensions SORTKEYs) without breaking the logic behind the repartition and the indexation?
Does it take into account the raw values without encoding to apply the DISTKEY and SORTKEY or rather the compressed values ?
Yes, you can apply compression without fear of impacting the DISTKEY
. Amazon Redshift will use the uncompressed values.
In fact, blocks are immediately decompressed when they are read from disk, so all operations are carried-out on uncompressed data.
Just remember the golden rules:
DISTKEY
on the column that is often used in a JOIN
SORTKEY
on columns that are often used in WHERE
After many days, I also manage to get an AWS staff response on this subject:
1) Can you apply encoding to column used as DISTKEY (and by extensions SORTKEYs) without breaking the logic behind the repartitions and the indexation?
You can apply encoding to the Distribution Key column which is also a Sort Key. However this is against our best practice recommendations as we do not advise applying encoding to a sort key column. Based on your question as you mention that the DIST KEY by extension could (is) also a SORT key then this would not be recommended. If the distribution key is not part of the sort key then you may encode it.
2) Does it take into account the raw values without encoding to apply the DISTKEY and SORTKEY or rather the compressed values?
The DISTKEY and SORTKEY algorithms are applied to the raw values. Compression is only at storage level which means that during query execution it is one the last steps when writing and one of the first steps before reading data. Looking at the example you have given where you are using run length encoding to encode the SORT KEY, we state specifically in our guide that "We do not recommend applying runlength encoding on any column that is designated as a sort key." This is due to the fact that range restricted scans might perform poorly if the sort key columns are compressed more highly. A rule of thump we recommend that the sort key should not be compressed as this can result in sort key skew. If you have time please have a look at our of our Redshift Deep Dive videos where we discuss compression in detail and actually mention this rule of thumb.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With