Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Amazon Redshift-Backup & Restore best practices?

We have a set of tables in Redshift with columns having IDENTITY property in it, for sequence generating. During testing phase there is a requirement of taking a backup and restore and this is a repeatative activity for each cycle of testing. We followed the below processes to take backup and then restore and faced the below issues:

  1. Traditional way: Created backup tables in another backup schema with CREATE TABLE XYZ_BKP AS SELECT * FROM XYZ. But doing that we lost the IDENTITY and other attributes of the table. So during restore if you try to create the table from the backup directly you lose the attribute properties and YOU CAN'T ALTER to add IDENTITY constraint.
  2. Traditional way backup and a different restore method: This time we dropped and recreate the table with DDL first and then tried to perform INSERT INTO from backup. But it can't insert values into the IDENTITY columns.
  3. UNLOAD and COPY: We also tried Redshift utilities such as UNLOAD to take a backup of the table in S3 and then restore using copy. It worked fine but then we faced other issues - a. DATE fields having leading zero didn't get extracted properly in the UNLOAD extract. Ex: A Date '0001-01-01' extracted as '1-01-01'. Then it's failing during the COPY saying not a valid date. There are other several errors its throwing during the restore (COPY) such missing data for not null fields or invalid value for int datatype. Which means the UNLOAD and COPY command together don't work in sync and values change.
  4. Table restore from snapshot: I haven't tried this but i understand AWS supports table restore now. But again it's a tedious job to set up this individually for 500 tables. Also you have keep and track snapshots for long.

It will be very helpful if you could suggest the best possible way to backup and restore in my scenario OR the best practices organizations follow.

like image 885
Genesis Avatar asked Feb 10 '18 09:02

Genesis


People also ask

How many backup copies of Redshift does AWS maintain?

Backing up your Redshift Clusters Automated backups are enabled by default when you create your Redshift cluster. Amazon will always attempt to maintain at least three copies of the data - the original and replica on the compute nodes and a backup in Amazon S3 service (s3 - Simple Storage Service).

Where do Redshift continuous backups get stored?

Data Durability: Amazon Redshift replicates your data within your data warehouse cluster and continuously backs up your data to Amazon S3, which is designed for eleven nines of durability. Amazon Redshift mirrors each drive's data to other nodes within your cluster.

Does Redshift have ETL?

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing ETL (extract, transform, and load), business intelligence (BI), and reporting tools.

How long are automatic Redshift backups retained by default?

The default retention period is one day, but you can modify it by using the Amazon Redshift console or programmatically by using the Amazon Redshift API or CLI. To disable automated snapshots, set the retention period to zero.


1 Answers

I would like to answer here point by point so it will be bit long, please excuse me for that;), but in my opinion, I feel that the best option is Unload to S3 and Copy to table from S3. Here, S3 could be replace with EC2.

  1. Traditional way- we prefer if we need to do some data alternation and we would like to dry run our queries.
  2. Traditional way backup and a different restore method same issues as of #1, we don't use.
  3. UNLOAD and COPY: This is most convenient method and even IDENTITIES could retain, hence always preferred method.

There are some problems listed in question, but most of them are false or could be avoided by supplying proper export/import parameters. I would like to provide all necessary steps with data to prove my point that, there are no issues in dates and timestamps during the load and unload.

Here I'm doing most of data types to prove my point.

create table sales(
salesid integer not null Identity,
commission decimal(8,2),
saledate date,
description varchar(255),
created_at timestamp default sysdate,
updated_at timestamp);

Content in CSV(sales-example.txt)

salesid,commission,saledate,description,created_at,updated_at
1|3.55|2018-12-10|Test description|2018-05-17 23:54:51|2018-05-17 23:54:51
2|6.55|2018-01-01|Test description|2018-05-17 23:54:51|2018-05-17 23:54:51
4|7.55|2018-02-10|Test description|2018-05-17 23:54:51|2018-05-17 23:54:51
5|3.55||Test description|2018-05-17 23:54:51|2018-05-17 23:54:51
7|3.50|2018-10-10|Test description|2018-05-17 23:54:51|2018-05-17 23:54:51

Copy command that will import date, timestamps, as well as IDs.

copy sales(salesid,commission,saledate,description,created_at,updated_at) from 's3://****/de***/sales-example.txt' credentials 'aws_access_key_id=************;aws_secret_access_key=***********' IGNOREHEADER  1 EXPLICIT_IDS;

This will copy 5 records. I'm doing here parallel off to get data in single CSV to prove point, though not required and should be avoided.

unload ('select salesid,commission,saledate,description,created_at,updated_at from sales') to 's3://assortdw/development/sales-example-2.txt' credentials 'aws_access_key_id=***********;aws_secret_access_key=***********' parallel off;

And below is my content again that which is exactly same as of import, meaning if run the Copy command to any other environment say dev or QA or somewhere, I will get the exact same records as of in Redshift cluster.

5|3.55||Test description|2018-05-17 23:54:51|2018-05-17 23:54:51
1|3.55|2018-12-10|Test description|2018-05-17 23:54:51|2018-05-17 23:54:51
7|3.50|2018-10-10|Test description|2018-05-17 23:54:51|2018-05-17 23:54:51
2|6.55|2018-01-01|Test description|2018-05-17 23:54:51|2018-05-17 23:54:51
4|7.55|2018-02-10|Test description|2018-05-17 23:54:51|2018-05-17 23:54:51
  1. Table restore from snapshot: This requires our `networking/infrastructure group, hence we avoid, though less sure about it. Othe experts are most wellcome to comment/share details about this.

I hope this answer the question, as well provide a start point to discuss/summarize/conclude. All are most welcome to add your points.

like image 100
Red Boy Avatar answered Oct 16 '22 05:10

Red Boy