Why do Redshift COPY queries use (much) more disk space for tables with a sort key

Tags:

amazon-redshift

I have a large set of data on S3 in the form of a few hundred CSV files that are ~1.7 TB in total (uncompressed). I am trying to copy it to an empty table on a Redshift cluster.

The cluster is empty (no other tables) and has 10 dw2.large nodes. If I set a sort key on the table, the copy commands uses up all available disk space about 25% of the way through, and aborts. If there's no sort key, the copy completes successfully and never uses more than 45% of the available disk space. This behavior is consistent whether or not I also set a distribution key.

I don't really know why this happens, or if it's expected. Has anyone seen this behavior? If so, do you have any suggestions for how to get around it? One idea would be to try importing each file individually, but I'd love to find a way to let Redshift deal with that part itself and do it all in one query.

539

asked Oct 13 '14 04:10

Evan

Video Answer

1 Answers

Got an answer to this from the Redshift team. The cluster needs free space of at least 2.5x the incoming data size to use as temporary space for the sort. You can upsize your cluster, copy the data, and resize it back down.

123

answered Nov 11 '22 15:11

Evan

Related questions
                            
                                Load data into Redshift using Node.js
                            
                                Connect to Redshift using Python using IAM Role
                            
                                Split values over multiple rows in RedShift
                            
                                How to handle Slowly Changing Dimension Type 2 in Redshift?
                            
                                Adding LIMIT fixes "Invalid digit, Value N" error in Amazon Redshift. Why?
                            
                                How to convert a character date time to be useable using dplyr and RPostgreSQL?
                            
                                How to grant bucket-owner-full-control to a file unloaded from redshift in one account to an s3 bucket in another account?
                            
                                INSERT INTO table SELECT Redshift super slow
                            
                                Why can't Amazon Redshift Parse this Valid JSON string?
                            
                                Amazon Redshift Stuck at 99% during resize operation
                            
                                Redshift PostgresQL syntax: is named window clause valid?
                            
                                Java sdk for copying to Redshift
                            
                                Copying only new records from AWS DynamoDB to AWS Redshift
                            
                                How do I make the response from Python's requests package be a "file-like object"
                            
                                Connecting to private Amazon Redshift from PowerBI service
                            
                                Long query in Amazon Redshift never return
                            
                                Use external table redshift spectrum defined in glue data catalog
                            
                                Unsigned field in Amazon Redshift?
                            
                                Aggregate UDFs with Python in Redshift
                            
                                while loop in Amazon redshift

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why do Redshift COPY queries use (much) more disk space for tables with a sort key

Tags:

amazon-redshift

Evan

People also ask

Video Answer

1 Answers

Evan

Recent Activity

Donate For Us