My S3 file (s3://MY_s3/stage/my_s3_file
) has a partition on load_dt
:
S3 Structure:
-MY_S3
--stage
---my_s3_file
----load_dt=2016-02-01
----000024
----load_dt=2016-02-02
----000036
The actual files are under load_dt partitions like 000024 & 000036.
The COPY command works fine if I am not keeping load_dt in Redshift table, but when I am adding load_dt in Redshift table the COPY command fails due to data error as the input layout & target layout are not matching (with extra load_dt on target).
The hive table which creates the S3 file shows partitioned load_dt at the end.
How to make this COPY command work with load_dt being on target Redshift?
I even tried changing S3 locations to s3://MY_s3/stage/my_s3_file/load_dt
but no luck.
Amazon Redshift does not support tablespaces, table partitioning, inheritance, and certain constraints.
You can specify partitioning and bucketing, for storing data from CTAS query results in Amazon S3. For information about CTAS queries, see CREATE TABLE AS SELECT (CTAS). This section discusses partitioning and bucketing as they apply to CTAS queries only.
A COPY command is the most efficient way to load a table. You can also add data to your tables using INSERT commands, though it is much less efficient than using COPY. The COPY command is able to read from multiple data files or multiple data streams simultaneously.
When Hive (running under Apache Hadoop) creates a partitioned EXTERNAL TABLE
, it separates files by directory. For example, all rows where load_dt=2016-02-01
are stored in the directory called load_dt=2016-02-01
.
The files within the directory do not store a value for the partition column (load_dt
). Instead, the value of the partition column is stored as part of the directory name. As such, renaming the directory will actually change the value in that column for all rows within the directory.
Yes, it's a bit strange but that's how Hive stores its data!
Amazon Redshift can import CSV files (including compressed CSV files) from Amazon S3. It can even import files from multiple sub-directories because it only looks at the path prefix of files to load. However, it does not understand the way that Hive stores partitioned data so it will not load the partition column from the directory name.
Some options:
COPY
that sets the column value based on some calculation (eg from a date field)I think I found the answer for my case.
I was not able to load the hive partition because Hive store that partition value in Hive metadata , that's why the partition column was not there in S3 file.
Now I am adding a new column to S3 through Hive as Load_Dt_New so the S3 file would have the required column for my Redshift COPY command to work.
"copy table from 's3://MY_s3/stage/my_s3_file/load_dt' " should work if you can change the file name to "load_dt_20160201" rather than "load_dt=2016-02-01". COPY command will take the last part of S3 path as prefix.
Redshift 'Copy' command will show errors under mismatched columns between table schema and parquet columns. So when you use range(daily) partition, you may be able to use the below script.
export PGUSER=sample
export PGPASSWORD=sample
export PGHOST=xxxx.xxxx.redshift.amazonaws.com
export PGPORT=5439
export PGDATABASE=xxxx
start_date=`date -d '2018-01-01' +%s`
base_s3='s3://bucket_and_table_root_path/range_column_name='
for day in {0..364}
do
base_ymd=`date -d @$((${start_date} + 3600 * 24 * ${day})) +'%Y-%m-%d'`
base_ymd_lower=`echo ${base_ymd} | sed '1,$s/-/_/g'`
echo ${base_s3}${base_ymd}
echo 'start-time' `date`
psql <<EOF
DROP TABLE loading_table_${base_ymd_lower}_temp;
CREATE TABLE loading_table_${base_ymd_lower}_temp (
col_a VARCHAR(xxxx) ENCODE LZO,
...
)
DISTSTYLE even
;
COPY loading_table_${base_ymd_lower}_temp
FROM '${base_s3}${base_ymd}'
iam_role 'arn:aws:iam::xxxxxxxxxxx:role/xxxxxxxxxx'
format as parquet
;
EOF
done
And next, you can use CTAS with UNION keyword in Redshift.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With