For Amazon RedShift, usually data are loaded from S3 using 'copy' command. I want to know if the command is atomic or not. E.g. is it possible that in some exceptional cases that only part of the data file is loaded into RedShift table?
The COPY command with default options is atomic. If the file includes an invalid line that can cause a load failure, the COPY transaction will be rollbacked and no data is imported.
If you want to skip invalid lines and not to stop the transaction, you can use the MAXERROR option for COPY command that ignores invalid lines. Here is the example that ignores up to 100 invalid lines.
COPY table_name from 's3://[bucket-name]/[file-path or prefix]' CREDENTIALS 'aws_access_key_id=xxxx;aws_secret_access_key=xxxx' DELIMITER '\t' MAXERROR 100;
If the number of invalid lines is more than MAXERROR error count(100), the transaction will be rollbacked.
See the following link for the details of COPY command. http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html
You can use the flag of NOLOAD
to check for errors before loading the data. This is a faster way to validate the format of your data as it doesn't try to load any data, just parse it.
You can define how many errors you are willing to tolerate with MAXERROR
flag
If you have more than the MAXERROR
count, your load will fail and no record is added.
See more information here: http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With