I am copying data from Amazon S3 to Redshift. During this process, I need to avoid the same files being loaded again. I don't have any unique constraints on my Redshift table. Is there a way to implement this using the copy command?
http://docs.aws.amazon.com/redshift/latest/dg/r_COPY_command_examples.html
I tried adding unique constraint and setting column as primary key with no luck. Redshift does not seem to support unique/primary key constraints.
A COPY command is the most efficient way to load a table. You can also add data to your tables using INSERT commands, though it is much less efficient than using COPY. The COPY command is able to read from multiple data files or multiple data streams simultaneously.
There's another solution to really avoid data duplication although it's not as straightforward as removing duplicated data once inserted. The copy command has the manifest option to specify which files you want to copy
copy customer
from 's3://mybucket/cust.manifest'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
manifest;
you can build a lambda that generates a new manifest file every time before you run the copy command. That lambda will compare the files already copied with the new files arrived and will create a new manifest with only the new files so that you will never ingest the same file twice
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With