Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Automate redshift truncate/delete data after a retention period

I have a redshift table and it is storing a lot of data. Every weekend I go and manually using Workbench TRUNCATE last week of data that I no longer need. I manually have to run

DELETE FROM tableName WHERE created_date BETWEEN timeStamp1 AND timeStamp2;

Is it possible to have some way to tell the table or have some expiration policy that removes the data every Sunday for me?
If not, Is there a way to automate the delete process every 7 days? Some sort of shell script or cron job in nodeJS that does this.

like image 293
Piqué Avatar asked Sep 24 '17 07:09

Piqué


People also ask

Can we rollback TRUNCATE in redshift?

Usage notes. The TRUNCATE command commits the transaction in which it is run; therefore, you can't roll back a TRUNCATE operation, and a TRUNCATE command may commit other operations when it commits itself.

How do I DELETE data from redshift?

To delete rows in a Redshift table, use the DELETE FROM statement: DELETE FROM products WHERE product_id=1; The WHERE clause is optional, but you'll usually want it, unless you really want to delete every row from the table.

Does redshift unload DELETE data?

Files that you remove by using the CLEANPATH option are permanently deleted and can't be recovered. You can't specify the CLEANPATH option if you specify the ALLOWOVERWRITE option. By default, UNLOAD writes data in parallel to multiple files, according to the number of slices in the cluster.

Does redshift run VACUUM automatically?

Amazon Redshift automatically sorts data and runs VACUUM DELETE in the background. This lessens the need to run the VACUUM command. For more information, see Vacuuming tables. By default, VACUUM skips the sort phase for any table where more than 95 percent of the table's rows are already sorted.


1 Answers

No, there is no in-built ability to run commands on a regular basis on Amazon Redshift. You could, however, run a script on another system that connects to Redshift and runs the command.

For example, a cron job that calls psql to connect to Redshift and execute the command. This could be done in a one-line script.

Alternatively, you could configure an AWS Lambda function to connect to Redshift and execute the command. (You would need to write the function yourself, but there are libraries that make this easier.) Then, you would configure Amazon CloudWatch Events to trigger the Lambda function on a desired schedule (eg once a week).

A common strategy is to actually store data in separate tables per time period (eg a month, but in your case it would be a week). Then, define a view that combines several tables. To delete a week of data, simply drop the table that contains that week of data, create a new table for this week's data, then update the view to point to the new table but not the old table.

By the way...

Your example uses the DELETE command, which is not the same as the TRUNCATE command.

TRUNCATE removes all data from a table. It is an efficient way to completely empty a table.

DELETE is good for removing part of a table but it simply marks rows as deleted. The data still occupies space on disk. Therefore, it is recommended that you VACUUM the table after deleting a significant quantity of data.

like image 105
John Rotenstein Avatar answered Sep 17 '22 16:09

John Rotenstein