Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Copy data from Amazon S3 to Redshift and avoid duplicate rows

I am copying data from Amazon S3 to Redshift. During this process, I need to avoid the same files being loaded again. I don't have any unique constraints on my Redshift table. Is there a way to implement this using the copy command?

http://docs.aws.amazon.com/redshift/latest/dg/r_COPY_command_examples.html

I tried adding unique constraint and setting column as primary key with no luck. Redshift does not seem to support unique/primary key constraints.

like image 258
Rups N Avatar asked Mar 29 '13 10:03

Rups N


People also ask

What is the most efficient and fastest way to load data into redshift?

A COPY command is the most efficient way to load a table. You can also add data to your tables using INSERT commands, though it is much less efficient than using COPY. The COPY command is able to read from multiple data files or multiple data streams simultaneously.


1 Answers

There's another solution to really avoid data duplication although it's not as straightforward as removing duplicated data once inserted. The copy command has the manifest option to specify which files you want to copy

copy customer
from 's3://mybucket/cust.manifest' 
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
manifest; 

you can build a lambda that generates a new manifest file every time before you run the copy command. That lambda will compare the files already copied with the new files arrived and will create a new manifest with only the new files so that you will never ingest the same file twice

like image 55
Daniel Avatar answered Sep 18 '22 14:09

Daniel