Redshift DISTKEY / SORTKEY

Tags:

I have a very technical question about how Redshift deals with DISTKEY and SORTKEY internally in order to fulfill the storing tier and the query execution demands. I have read this amazing post that explains very well what means each of these regarding the table design.

My question is let's suppose I have a table A with three columns:

CREATE TABLE (
orderdate timestamp distkey,
product_id varchar(50),
product_name varchar(250)
) SORTKEY (product_id)

Now, we know that Redshift is a columnar approach DB optimized for data warehousing. In my example is clear that probably the way how the data will be distributed across the slices for the computing nodes is based on the DISTKEY orderdate. But, what happens with the column product_id and product_name ? are these distributed along with orderdate on the same slice and then when I execute a query Redshift uses the zone maps based on my SORTKEY to point out the zone of the column that has the data and retrieve it?

If Redshift is a columnar approach then shouldn't each column has a different way to be stored? or what this really means is that: Based on a column wisely picked out among all, the whole columns are going to be stored on the same slice along with the DISTKEY and then to guarantee the performance the user can even focus the query on a specific zone to pull the required data. So I might overall something like:

DISTKEY storage tier and SORTKEY query execution behave

Now if I use a DISTKEY so my data is stored based that punctual column order, so if later on, I use a SORTKEY the other for my DISTKEY can't be changed or altered so how this works?

So sorry folks if I'm so wrong but I need to understand well how this architecture drive the data internally. Thanks so much

Update

Based on the @JoeHarris post answering this question I have tried to picture how the data perhaps look stored.

The first level of distribution is my DISTKEY (dates are not good but just to follow with the same example) and then internally redshift sorts by my SORTKEY, giving something like:

enter image description here

thanks for the feedback

595

asked Oct 19 '18 02:10

Andres Urrego Angel

1 Answers

The DISTKEY distributes rows amongst slices.

In your example, all rows with a given orderdate would be located in the same slice. That means that all columns for those rows are in that slice.

If two tables have the same DISTKEY, then all rows in both tables with the same value for the DISTKEY column will be located on the same slice.

By the way, dates and timestamps are not good candidates for DISTKEY because they are very rarely used in a JOIN. Unique identifiers like product_id would make a better DISTKEY. The general rule is to use a column that appears in the most/biggest JOINs.

The SORTKEY determines how the rows are ordered within the table. For the rows stored on each slice, they are stored in SORTKEY order. Data for each column is stored in separate blocks (and most likely each column uses many blocks), but within the column blocks the rows are in the same order.

For example, if a table has three columns, it will occupy at least three blocks per slice (one for each column). Within those column blocks, the rows are all in the same order.

Each block also has a min and max value ("Zone Maps"), making it very easy for Redshift to 'skip over' blocks that do not contain a desired value. This greatly speeds performance because disk access is the slowest part of an operation.

154

answered Nov 05 '22 00:11

John Rotenstein

Related questions
                            
                                Is it possible to subscribe to a WebSocket in a serverless fashion using AWS?
                            
                                How to debug a aws lambda function?
                            
                                How do I pass a list of strings as a parameter in CloudFormation?
                            
                                Is there a way to generate the AWS Console URLs for CloudWatch Log Group filters?
                            
                                How can I prevent EC2 instance termination by Auto Scaling?
                            
                                Uploading PDF to Amazon S3 and display in-browser
                            
                                How to provide multiple StringNotEquals conditions in AWS policy?
                            
                                Do AWS support SES in CloudFormation?
                            
                                How to keep desired amount of AWS Lambda function containers warm
                            
                                Heroku: Couldn't find Active Storage configuration in /app/config/storage.yml (RuntimeError)
                            
                                MalformedPolicyDocument error when creating policy via terraform
                            
                                How to use Terraform in a cloud agnostic way
                            
                                DynamoDb delete non-existent item does not fail, why?
                            
                                Can you run a local copy of AWS DynamoDB somehow?
                            
                                Retrieving public dns of EC2 instance with BOTO3
                            
                                Multiple conditions in cloud formation resource creation
                            
                                django storages aws s3 delete file from model record
                            
                                Where can I see tables for RDS instances in AWS console?
                            
                                How to speed up deployments on AWS Fargate?
                            
                                Overwrite parquet files from dynamic frame in AWS Glue

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Redshift DISTKEY / SORTKEY

Tags:

amazon-web-services

database-design

amazon-redshift

Andres Urrego Angel

People also ask

1 Answers

John Rotenstein

Recent Activity

Donate For Us