Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Redshift UPDATE prohibitively slow

I have a table in a Redshift cluster with ~1 billion rows. I have a job that tries to update some column values based on some filter. Updating anything at all in this table is incredibly slow. Here's an example:

SELECT col1, col2, col3
FROM SOMETABLE
WHERE col1 = 'a value of col1'
  AND col2 = 12;

The above query returns in less than a second, because I have sortkeys on col1 and col2. There is only one row that meets this criteria, so the result set is just one row. However, if I run:

UPDATE SOMETABLE
SET col3 = 20
WHERE col1 = 'a value of col1'
  AND col2 = 12;

This query takes an unknown amount of time (I stopped it after 20 minutes). Again, it should be updating one column value of one row.

I have also tried to follow the documentation here: http://docs.aws.amazon.com/redshift/latest/dg/merge-specify-a-column-list.html, which talks about creating a temporary staging table to update the main table, but got the same results.

Any idea what is going on here?

like image 383
user37760 Avatar asked Sep 09 '14 18:09

user37760


1 Answers

You didn't mention what percentage of the table you're updating but it's important to note that an UPDATE in Redshift is a 2 step process:

  1. Each row that will be changed must be first marked for deletion
  2. Then a new version of the data must be written for each column in the table

If you have a large number of columns and/or are updating a large number of rows then this process can be very labor intensive for the database.

You could experiment with using a CREATE TABLE AS statement to create a new "updated" version of the table and then dropping the existing table and renaming the new table. This has the added benefit of leaving you with a fully sorted table.

like image 81
Joe Harris Avatar answered Sep 28 '22 09:09

Joe Harris