loading Redshift from S3 (with partitions)

Tags:

My S3 file (s3://MY_s3/stage/my_s3_file) has a partition on load_dt:

S3 Structure:

-MY_S3
 --stage
  ---my_s3_file
    ----load_dt=2016-02-01
         ----000024
    ----load_dt=2016-02-02
         ----000036

The actual files are under load_dt partitions like 000024 & 000036.

The COPY command works fine if I am not keeping load_dt in Redshift table, but when I am adding load_dt in Redshift table the COPY command fails due to data error as the input layout & target layout are not matching (with extra load_dt on target).

The hive table which creates the S3 file shows partitioned load_dt at the end.

How to make this COPY command work with load_dt being on target Redshift?

I even tried changing S3 locations to s3://MY_s3/stage/my_s3_file/load_dt but no luck.

801

asked Apr 15 '16 21:04

rkj

4 Answers

When Hive (running under Apache Hadoop) creates a partitioned EXTERNAL TABLE, it separates files by directory. For example, all rows where load_dt=2016-02-01 are stored in the directory called load_dt=2016-02-01.

The files within the directory do not store a value for the partition column (load_dt). Instead, the value of the partition column is stored as part of the directory name. As such, renaming the directory will actually change the value in that column for all rows within the directory.

Yes, it's a bit strange but that's how Hive stores its data!

Amazon Redshift can import CSV files (including compressed CSV files) from Amazon S3. It can even import files from multiple sub-directories because it only looks at the path prefix of files to load. However, it does not understand the way that Hive stores partitioned data so it will not load the partition column from the directory name.

Some options:

Add another column in the output data that contains the same value (as you have now done)
Run a command after the COPY that sets the column value based on some calculation (eg from a date field)
Load one directory at a time (complex to achieve and not as efficient)

answered Oct 12 '22 15:10

John Rotenstein

I think I found the answer for my case.

I was not able to load the hive partition because Hive store that partition value in Hive metadata , that's why the partition column was not there in S3 file.

Now I am adding a new column to S3 through Hive as Load_Dt_New so the S3 file would have the required column for my Redshift COPY command to work.

answered Oct 12 '22 15:10

rkj

"copy table from 's3://MY_s3/stage/my_s3_file/load_dt' " should work if you can change the file name to "load_dt_20160201" rather than "load_dt=2016-02-01". COPY command will take the last part of S3 path as prefix.

answered Oct 12 '22 15:10

Dabay Wang

Redshift 'Copy' command will show errors under mismatched columns between table schema and parquet columns. So when you use range(daily) partition, you may be able to use the below script.

export PGUSER=sample
export PGPASSWORD=sample
export PGHOST=xxxx.xxxx.redshift.amazonaws.com
export PGPORT=5439
export PGDATABASE=xxxx

start_date=`date -d '2018-01-01' +%s`
base_s3='s3://bucket_and_table_root_path/range_column_name='


for day in {0..364}
do
  base_ymd=`date -d @$((${start_date} + 3600 * 24 * ${day})) +'%Y-%m-%d'`
  base_ymd_lower=`echo ${base_ymd} | sed '1,$s/-/_/g'`
  echo ${base_s3}${base_ymd}
  echo 'start-time' `date`
  psql <<EOF
DROP TABLE loading_table_${base_ymd_lower}_temp;
CREATE TABLE loading_table_${base_ymd_lower}_temp (
    col_a VARCHAR(xxxx) ENCODE LZO,
    ...
)
DISTSTYLE even
;
COPY loading_table_${base_ymd_lower}_temp
FROM '${base_s3}${base_ymd}'
iam_role 'arn:aws:iam::xxxxxxxxxxx:role/xxxxxxxxxx'
format as parquet
;
EOF

done

And next, you can use CTAS with UNION keyword in Redshift.

answered Oct 12 '22 15:10

JoyPlayer

Related questions
                            
                                Amazon RDS Read Replica -- Skipping Replication Errors
                            
                                Amazon Cognito connection cannot find IdentityPool
                            
                                How to load gzipped json data from a copy
                            
                                What happen when I set Quiet to true in Amazon S3 for anabling Quiet Mode in deleteObjects() method?
                            
                                Cluster hangs in 'ssh-ready' state using Spark 1.2.0 EC2 launch script
                            
                                aws kinesis get-records returns empty array
                            
                                Using amazon RDS with WordPress over SSL
                            
                                Cookies set, but not sent to Cloudfront
                            
                                Serving private content from CloudFront with Signed Cookies
                            
                                CloudFormation SQS Queue Redrive policy dependency on a DLQ created
                            
                                How to get a response from NSURLSessionDownloadTask downloadTaskWithRequest
                            
                                Failing to deploy a Laravel app to EC2
                            
                                Change hostname for AWS ec2 instance
                            
                                How can I detect if a tcp connection has been forwarded from a ssl connection?
                            
                                Difference between S3 public files and static websites
                            
                                How to rotate logs in AWS CloudWatch?
                            
                                Flask-SQLAlchemy ssl-connection with AWS RDS error
                            
                                Load Balancer $_SERVER['REMOTE_ADDR'] Not working
                            
                                Export from Amazon Redshift into an RDS MySQL database
                            
                                Do I have have to use Amazon Route 53's DNS Service (and pay for it), if I register and manage my domain with them?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

loading Redshift from S3 (with partitions)

Tags:

amazon-web-services

amazon-s3

hive

amazon-redshift

rkj

People also ask

4 Answers

John Rotenstein

rkj

Dabay Wang

JoyPlayer

Recent Activity

Donate For Us