We have a set of tables in Redshift with columns having IDENTITY property in it, for sequence generating. During testing phase there is a requirement of taking a backup and restore and this is a repeatative activity for each cycle of testing. We followed the below processes to take backup and then restore and faced the below issues: <ol> <li> Traditional way: Created backup tables in another backup schema with CREATE TABLE XYZ_BKP AS SELECT * FROM XYZ. But doing that we lost the IDENTITY and other attributes of the table. So during restore if you try to create the table from the backup directly you lose the attribute properties and YOU CAN'T ALTER to add IDENTITY constraint.</li> <li> Traditional way backup and a different restore method: This time we dropped and recreate the table with DDL first and then tried to perform INSERT INTO from backup. But it can't insert values into the IDENTITY columns.</li> <li> UNLOAD and COPY: We also tried Redshift utilities such as UNLOAD to take a backup of the table in S3 and then restore using copy. It worked fine but then we faced other issues - a. DATE fields having leading zero didn't get extracted properly in the UNLOAD extract. Ex: A Date '0001-01-01' extracted as '1-01-01'. Then it's failing during the COPY saying not a valid date. There are other several errors its throwing during the restore (COPY) such missing data for not null fields or invalid value for int datatype. Which means the UNLOAD and COPY command together don't work in sync and values change.</li> <li> Table restore from snapshot: I haven't tried this but i understand AWS supports table restore now. But again it's a tedious job to set up this individually for 500 tables. Also you have keep and track snapshots for long.</li> </ol> It will be very helpful if you could suggest the best possible way to backup and restore in my scenario OR the best practices organizations follow.

I would like to answer here point by point so it will be bit long, please excuse me for that;), but in my opinion, I feel that the best option is <code>Unload to S3</code> and <code>Copy to table from S3</code>. Here, S3 could be replace with <code>EC2</code>. <ol> <li> Traditional way- we prefer if we need to do some data alternation and we would like to dry run our queries.</li> <li> Traditional way backup and a different restore method same issues as of #1, we don't use.</li> <li> UNLOAD and COPY: This is most convenient method and even IDENTITIES could retain, hence always preferred method. </li> </ol> There are some problems listed in question, but most of them are false or could be avoided by supplying proper export/import parameters. I would like to provide all necessary steps with data to prove my point that, there are no issues in <code>dates</code> and <code>timestamps</code> during the load and unload. Here I'm doing most of data types to prove my point. <pre class="prettyprint"><code>create table sales( salesid integer not null Identity, commission decimal(8,2), saledate date, description varchar(255), created_at timestamp default sysdate, updated_at timestamp); </code></pre> Content in CSV(sales-example.txt) <pre class="prettyprint"><code>salesid,commission,saledate,description,created_at,updated_at 1|3.55|2018-12-10|Test description|2018-05-17 23:54:51|2018-05-17 23:54:51 2|6.55|2018-01-01|Test description|2018-05-17 23:54:51|2018-05-17 23:54:51 4|7.55|2018-02-10|Test description|2018-05-17 23:54:51|2018-05-17 23:54:51 5|3.55||Test description|2018-05-17 23:54:51|2018-05-17 23:54:51 7|3.50|2018-10-10|Test description|2018-05-17 23:54:51|2018-05-17 23:54:51 </code></pre> Copy command that will import <code>date</code>, <code>timestamps</code>, as well as IDs. <pre class="prettyprint"><code>copy sales(salesid,commission,saledate,description,created_at,updated_at) from 's3://****/de***/sales-example.txt' credentials 'aws_access_key_id=************;aws_secret_access_key=***********' IGNOREHEADER 1 EXPLICIT_IDS; </code></pre> This will copy 5 records. I'm doing here <code>parallel off</code> to get data in single <code>CSV</code> to prove point, though not required and should be avoided. <pre class="prettyprint"><code>unload ('select salesid,commission,saledate,description,created_at,updated_at from sales') to 's3://assortdw/development/sales-example-2.txt' credentials 'aws_access_key_id=***********;aws_secret_access_key=***********' parallel off; </code></pre> And below is my content again that which is exactly same as of import, meaning if run the <code>Copy</code> command to any other environment say <code>dev</code> or <code>QA</code> or somewhere, I will get the exact same records as of in <code>Redshift</code> cluster. <pre class="prettyprint"><code>5|3.55||Test description|2018-05-17 23:54:51|2018-05-17 23:54:51 1|3.55|2018-12-10|Test description|2018-05-17 23:54:51|2018-05-17 23:54:51 7|3.50|2018-10-10|Test description|2018-05-17 23:54:51|2018-05-17 23:54:51 2|6.55|2018-01-01|Test description|2018-05-17 23:54:51|2018-05-17 23:54:51 4|7.55|2018-02-10|Test description|2018-05-17 23:54:51|2018-05-17 23:54:51 </code></pre> <ol start="4"> <li> Table restore from snapshot: This requires our `networking/infrastructure group, hence we avoid, though less sure about it. Othe experts are most wellcome to comment/share details about this.</li> </ol> I hope this answer the question, as well provide a start point to <code>discuss/summarize/conclude</code>. All are most welcome to add your points.

Amazon Redshift-Backup & Restore best practices?

Tags:

amazon-s3

amazon-redshift

We have a set of tables in Redshift with columns having IDENTITY property in it, for sequence generating. During testing phase there is a requirement of taking a backup and restore and this is a repeatative activity for each cycle of testing. We followed the below processes to take backup and then restore and faced the below issues:

Traditional way: Created backup tables in another backup schema with CREATE TABLE XYZ_BKP AS SELECT * FROM XYZ. But doing that we lost the IDENTITY and other attributes of the table. So during restore if you try to create the table from the backup directly you lose the attribute properties and YOU CAN'T ALTER to add IDENTITY constraint.
Traditional way backup and a different restore method: This time we dropped and recreate the table with DDL first and then tried to perform INSERT INTO from backup. But it can't insert values into the IDENTITY columns.
UNLOAD and COPY: We also tried Redshift utilities such as UNLOAD to take a backup of the table in S3 and then restore using copy. It worked fine but then we faced other issues - a. DATE fields having leading zero didn't get extracted properly in the UNLOAD extract. Ex: A Date '0001-01-01' extracted as '1-01-01'. Then it's failing during the COPY saying not a valid date. There are other several errors its throwing during the restore (COPY) such missing data for not null fields or invalid value for int datatype. Which means the UNLOAD and COPY command together don't work in sync and values change.
Table restore from snapshot: I haven't tried this but i understand AWS supports table restore now. But again it's a tedious job to set up this individually for 500 tables. Also you have keep and track snapshots for long.

It will be very helpful if you could suggest the best possible way to backup and restore in my scenario OR the best practices organizations follow.

885

asked Feb 10 '18 09:02

Genesis

1 Answers

I would like to answer here point by point so it will be bit long, please excuse me for that;), but in my opinion, I feel that the best option is Unload to S3 and Copy to table from S3. Here, S3 could be replace with EC2.

Traditional way- we prefer if we need to do some data alternation and we would like to dry run our queries.
Traditional way backup and a different restore method same issues as of #1, we don't use.
UNLOAD and COPY: This is most convenient method and even IDENTITIES could retain, hence always preferred method.

There are some problems listed in question, but most of them are false or could be avoided by supplying proper export/import parameters. I would like to provide all necessary steps with data to prove my point that, there are no issues in dates and timestamps during the load and unload.

Here I'm doing most of data types to prove my point.

create table sales(
salesid integer not null Identity,
commission decimal(8,2),
saledate date,
description varchar(255),
created_at timestamp default sysdate,
updated_at timestamp);

Content in CSV(sales-example.txt)

salesid,commission,saledate,description,created_at,updated_at
1|3.55|2018-12-10|Test description|2018-05-17 23:54:51|2018-05-17 23:54:51
2|6.55|2018-01-01|Test description|2018-05-17 23:54:51|2018-05-17 23:54:51
4|7.55|2018-02-10|Test description|2018-05-17 23:54:51|2018-05-17 23:54:51
5|3.55||Test description|2018-05-17 23:54:51|2018-05-17 23:54:51
7|3.50|2018-10-10|Test description|2018-05-17 23:54:51|2018-05-17 23:54:51

Copy command that will import date, timestamps, as well as IDs.

copy sales(salesid,commission,saledate,description,created_at,updated_at) from 's3://****/de***/sales-example.txt' credentials 'aws_access_key_id=************;aws_secret_access_key=***********' IGNOREHEADER  1 EXPLICIT_IDS;

This will copy 5 records. I'm doing here parallel off to get data in single CSV to prove point, though not required and should be avoided.

unload ('select salesid,commission,saledate,description,created_at,updated_at from sales') to 's3://assortdw/development/sales-example-2.txt' credentials 'aws_access_key_id=***********;aws_secret_access_key=***********' parallel off;

And below is my content again that which is exactly same as of import, meaning if run the Copy command to any other environment say dev or QA or somewhere, I will get the exact same records as of in Redshift cluster.

5|3.55||Test description|2018-05-17 23:54:51|2018-05-17 23:54:51
1|3.55|2018-12-10|Test description|2018-05-17 23:54:51|2018-05-17 23:54:51
7|3.50|2018-10-10|Test description|2018-05-17 23:54:51|2018-05-17 23:54:51
2|6.55|2018-01-01|Test description|2018-05-17 23:54:51|2018-05-17 23:54:51
4|7.55|2018-02-10|Test description|2018-05-17 23:54:51|2018-05-17 23:54:51

Table restore from snapshot: This requires our `networking/infrastructure group, hence we avoid, though less sure about it. Othe experts are most wellcome to comment/share details about this.

I hope this answer the question, as well provide a start point to discuss/summarize/conclude. All are most welcome to add your points.

100

answered Oct 16 '22 05:10

Red Boy

Related questions
                            
                                AWS S3 Access Denied Only Sometimes
                            
                                dropzone.js direct upload to S3 with content-type
                            
                                How to add resource policy to existing S3 bucket with CDK in JavaScript?
                            
                                Reading data from S3 using pyspark throws java.lang.NumberFormatException: For input string: "100M"
                            
                                How do I delete/count objects in a s3 bucket?
                            
                                Where to instantiate boto s3 client so it is reused during a request?
                            
                                Restrict cloudfront signed url (GET Request) to be accessed by my mobile application
                            
                                Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: xxxxxxxxxxxxx)
                            
                                Can I persist SNS (or SQS) messages to S3 through given AWS integrations?
                            
                                Proxy a file from S3 with Heroku and Rails
                            
                                Amazon MapReduce best practices for logs analysis
                            
                                Properly format multipart/form-data body
                            
                                s3cmd sync is remote copying the wrong files to the wrong locations
                            
                                Spring Boot and Amazon AWS - how to connect to S3 using Spring Cloud AWS?
                            
                                PySpark s3 Access with Multiple AWS Credential Profiles?
                            
                                Is there a way to specify file extension to the file saved to s3 by kinesis firehose
                            
                                Pre-signed URLs and x-amz-acl
                            
                                Enabling POST/PUT/DELETE on AWS CloudFront?
                            
                                Access HDF files stored on s3 in pandas
                            
                                CloudFront Error: This XML file does not appear to have any style information associated with it [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With