Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

loading Redshift from S3 (with partitions)

My S3 file (s3://MY_s3/stage/my_s3_file) has a partition on load_dt:

S3 Structure:

-MY_S3
 --stage
  ---my_s3_file
    ----load_dt=2016-02-01
         ----000024
    ----load_dt=2016-02-02
         ----000036

The actual files are under load_dt partitions like 000024 & 000036.

The COPY command works fine if I am not keeping load_dt in Redshift table, but when I am adding load_dt in Redshift table the COPY command fails due to data error as the input layout & target layout are not matching (with extra load_dt on target).

The hive table which creates the S3 file shows partitioned load_dt at the end.

How to make this COPY command work with load_dt being on target Redshift?

I even tried changing S3 locations to s3://MY_s3/stage/my_s3_file/load_dt but no luck.

like image 801
rkj Avatar asked Apr 15 '16 21:04

rkj


People also ask

Do we have partitions in Redshift?

Amazon Redshift does not support tablespaces, table partitioning, inheritance, and certain constraints.

Can S3 be partitioned?

You can specify partitioning and bucketing, for storing data from CTAS query results in Amazon S3. For information about CTAS queries, see CREATE TABLE AS SELECT (CTAS). This section discusses partitioning and bucketing as they apply to CTAS queries only.

What is the most efficient and fastest way to load data into Redshift?

A COPY command is the most efficient way to load a table. You can also add data to your tables using INSERT commands, though it is much less efficient than using COPY. The COPY command is able to read from multiple data files or multiple data streams simultaneously.


4 Answers

When Hive (running under Apache Hadoop) creates a partitioned EXTERNAL TABLE, it separates files by directory. For example, all rows where load_dt=2016-02-01 are stored in the directory called load_dt=2016-02-01.

The files within the directory do not store a value for the partition column (load_dt). Instead, the value of the partition column is stored as part of the directory name. As such, renaming the directory will actually change the value in that column for all rows within the directory.

Yes, it's a bit strange but that's how Hive stores its data!

Amazon Redshift can import CSV files (including compressed CSV files) from Amazon S3. It can even import files from multiple sub-directories because it only looks at the path prefix of files to load. However, it does not understand the way that Hive stores partitioned data so it will not load the partition column from the directory name.

Some options:

  • Add another column in the output data that contains the same value (as you have now done)
  • Run a command after the COPY that sets the column value based on some calculation (eg from a date field)
  • Load one directory at a time (complex to achieve and not as efficient)
like image 57
John Rotenstein Avatar answered Oct 12 '22 15:10

John Rotenstein


I think I found the answer for my case.

I was not able to load the hive partition because Hive store that partition value in Hive metadata , that's why the partition column was not there in S3 file.

Now I am adding a new column to S3 through Hive as Load_Dt_New so the S3 file would have the required column for my Redshift COPY command to work.

like image 45
rkj Avatar answered Oct 12 '22 15:10

rkj


"copy table from 's3://MY_s3/stage/my_s3_file/load_dt' " should work if you can change the file name to "load_dt_20160201" rather than "load_dt=2016-02-01". COPY command will take the last part of S3 path as prefix.

like image 30
Dabay Wang Avatar answered Oct 12 '22 15:10

Dabay Wang


Redshift 'Copy' command will show errors under mismatched columns between table schema and parquet columns. So when you use range(daily) partition, you may be able to use the below script.

export PGUSER=sample
export PGPASSWORD=sample
export PGHOST=xxxx.xxxx.redshift.amazonaws.com
export PGPORT=5439
export PGDATABASE=xxxx

start_date=`date -d '2018-01-01' +%s`
base_s3='s3://bucket_and_table_root_path/range_column_name='


for day in {0..364}
do
  base_ymd=`date -d @$((${start_date} + 3600 * 24 * ${day})) +'%Y-%m-%d'`
  base_ymd_lower=`echo ${base_ymd} | sed '1,$s/-/_/g'`
  echo ${base_s3}${base_ymd}
  echo 'start-time' `date`
  psql <<EOF
DROP TABLE loading_table_${base_ymd_lower}_temp;
CREATE TABLE loading_table_${base_ymd_lower}_temp (
    col_a VARCHAR(xxxx) ENCODE LZO,
    ...
)
DISTSTYLE even
;
COPY loading_table_${base_ymd_lower}_temp
FROM '${base_s3}${base_ymd}'
iam_role 'arn:aws:iam::xxxxxxxxxxx:role/xxxxxxxxxx'
format as parquet
;
EOF

done

And next, you can use CTAS with UNION keyword in Redshift.

like image 1
JoyPlayer Avatar answered Oct 12 '22 15:10

JoyPlayer