Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why do Redshift COPY queries use (much) more disk space for tables with a sort key

I have a large set of data on S3 in the form of a few hundred CSV files that are ~1.7 TB in total (uncompressed). I am trying to copy it to an empty table on a Redshift cluster.

The cluster is empty (no other tables) and has 10 dw2.large nodes. If I set a sort key on the table, the copy commands uses up all available disk space about 25% of the way through, and aborts. If there's no sort key, the copy completes successfully and never uses more than 45% of the available disk space. This behavior is consistent whether or not I also set a distribution key.

I don't really know why this happens, or if it's expected. Has anyone seen this behavior? If so, do you have any suggestions for how to get around it? One idea would be to try importing each file individually, but I'd love to find a way to let Redshift deal with that part itself and do it all in one query.

like image 539
Evan Avatar asked Oct 13 '14 04:10

Evan


People also ask

What is the use of sort key in Redshift?

Amazon Redshift stores your data on disk in sorted order according to the sort key. The Amazon Redshift query optimizer uses sort order when it determines optimal query plans. When you use automatic table optimization, you don't need to choose the sort key of your table.

What sort of storage does Amazon Redshift use for database tables?

As records enter the system, Amazon Redshift transparently converts the data to columnar storage for each of the columns. In this simplified example, using columnar storage, each data block holds column field values for as many as three times as many records as row-based storage.

How many sort keys can Redshift have?

Amazon Redshift supports two different types of Sort Keys, Compound Sort Keys, and Interleaved Sort Keys. Selecting the right kind requires knowledge of the queries that you plan to execute.

Can we have multiple sort keys in Redshift?

Redshift allows designating multiple columns as SORTKEY columns, but most of the best-practices documentation is written as if there were only a single SORTKEY.


Video Answer


1 Answers

Got an answer to this from the Redshift team. The cluster needs free space of at least 2.5x the incoming data size to use as temporary space for the sort. You can upsize your cluster, copy the data, and resize it back down.

like image 123
Evan Avatar answered Nov 11 '22 15:11

Evan