Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

INSERT INTO table SELECT Redshift super slow

We have a large table, that we need to do a DEEP COPY on it. Since we don't have enough empty disk space to make it in one statements I've tried to make it in batches. But the batches seem to run very very slowly.

I'm running something like this:

   INSERT INTO new_table 
   SELECT * FROM old_table 
    WHERE creation_date between '2018-01-01' AND '2018-02-01'

Even though the query returns small amount of lines ~ 1K

SELECT * FROM old_table 
WHERE creation_date between '2018-01-01' AND '2018-02-01'
  • The INSERT query take around 50 minutes to complete.

  • The old_table has ~286M rows and ~400 columns

  • creation_date is one of the SORTKEYs

Explain plan looks like:

XN Seq Scan on old_table  (cost=0.00..4543811.52 rows=178152 width=136883)
      Filter: ((creation_date <= '2018-02-01'::date) AND (creation_date >= '2018 01-01'::date))

My question is:

  • What may be the reason for INSERT query to take this long?
like image 468
AlexV Avatar asked May 27 '18 06:05

AlexV


People also ask

Why is Redshift so slow?

Dataset size – A higher volume of data in the cluster can slow query performance for queries, because more rows need to be scanned and redistributed. You can mitigate this effect by regular vacuuming and archiving of data, and by using a predicate to restrict the query dataset.

Why is copy better than insert Redshift?

COPY Command is your friend Instead, Redshift offers the COPY command provided specifically for bulk inserts. It lets you upload rows stored in S3, EMR, DynamoDB, or a remote host via SSH to a table. It's much more efficient compared to INSERT queries when run on a huge number of rows.

Is Update slow in Redshift?

Performing User UPDATEs in RedshiftThis ended up being way too slow. A row update in Redshift consists of marking the row for deletion, and inserting a new row with the updated data. Redshift stores columns in immutable 1MB blocks, so updating a single row requires creating a new 1MB block for each column.


1 Answers

In my opinion, following are two possibilities--- though if you could add more details to your question will be great.

  1. As @John stated in comments, your SORTKEY matters a lot in RedShift, is creation_date sortkey?
  2. Did you do lot of updates to your old_table, if so, you must to vacuum first do VACUUM DELETE Only old_table then, do select queries.

Other option, you might be doing S3 way, but not sure do you want to do it.

like image 108
Red Boy Avatar answered Oct 14 '22 00:10

Red Boy