Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can encoding compressions be applied to DISTKEY (without performance issues)

I am using Amazon Redshift to store a relationship table connected to a huge tables of logs.

The schema should look like:

CREATE TABLE public.my_table (
  id INT IDENTITY(1,1),
  identifier INTEGER      NOT NULL encode lzo DISTKEY,
  foreign_id VARCHAR(36)  NOT NULL encode runlength 
)
SORTKEY(foreign_id);

My question is: Can I apply encoding to the column used as DISTKEY (and by extensions SORTKEYs) without breaking the logic behind the repartition and the indexation?

Does it take into account the raw values without encoding to apply the DISTKEY and SORTKEY or rather the compressed values ?

like image 583
lelabo_m Avatar asked Jan 28 '23 09:01

lelabo_m


2 Answers

Yes, you can apply compression without fear of impacting the DISTKEY. Amazon Redshift will use the uncompressed values.

In fact, blocks are immediately decompressed when they are read from disk, so all operations are carried-out on uncompressed data.

Just remember the golden rules:

  • Use DISTKEY on the column that is often used in a JOIN
  • Use SORTKEY on columns that are often used in WHERE
  • Always compress data (less disk reads means faster access) — and the automatic compression normally finds the best encoding method
like image 87
John Rotenstein Avatar answered Apr 28 '23 05:04

John Rotenstein


After many days, I also manage to get an AWS staff response on this subject:

1) Can you apply encoding to column used as DISTKEY (and by extensions SORTKEYs) without breaking the logic behind the repartitions and the indexation?

You can apply encoding to the Distribution Key column which is also a Sort Key. However this is against our best practice recommendations as we do not advise applying encoding to a sort key column. Based on your question as you mention that the DIST KEY by extension could (is) also a SORT key then this would not be recommended. If the distribution key is not part of the sort key then you may encode it.

2) Does it take into account the raw values without encoding to apply the DISTKEY and SORTKEY or rather the compressed values?

The DISTKEY and SORTKEY algorithms are applied to the raw values. Compression is only at storage level which means that during query execution it is one the last steps when writing and one of the first steps before reading data. Looking at the example you have given where you are using run length encoding to encode the SORT KEY, we state specifically in our guide that "We do not recommend applying runlength encoding on any column that is designated as a sort key." This is due to the fact that range restricted scans might perform poorly if the sort key columns are compressed more highly. A rule of thump we recommend that the sort key should not be compressed as this can result in sort key skew. If you have time please have a look at our of our Redshift Deep Dive videos where we discuss compression in detail and actually mention this rule of thumb.

like image 45
lelabo_m Avatar answered Apr 28 '23 04:04

lelabo_m