We are migrating from Redshift to Spark. I have a table in Redshift that I need to export to S3. From S3 this will be fed to Apache Spark (EMR).
I found there is only one way to export data from Redshift. And that is UNLOAD
command. And unload can not export typed data. It exports csv which is a table of strings. Based on different format (quote, delimiter etc) Spark doesn't seem to recognize it well. So I am looking for a way to unload them and make sure they are read by spark with proper type.
Is there any way to unload data as JSON or other typed format that is recognizable to Spark?
The basic syntax to export your data is as below. UNLOAD ('SELECT * FROM your_table') TO 's3://object-path/name-prefix' IAM_ROLE 'arn:aws:iam::<aws-account-id>:role/<role-name>' CSV; On the first line, you query the data you want to export. Be aware that Redshift only allows a LIMIT clause in an inner SELECT statement.
Amazon Redshift supports writing nested JSON data when your query result contains columns using SUPER, the native Amazon Redshift data type to store semi-structured data or documents as values. Support for exporting JSON data using UNLOAD is available in all AWS commercial Regions.
Amazon DocumentDB is a natural and convenient choice for Habby as all of our game data is JSON and Amazon DocumentDB makes it easy to read and write that JSON data to the database.
Redshift export table is done using either UNLOAD command, COPY command or PostgreSQL command. Using UNLOAD or COPY command is fasted way to export Redshift table, but with those commands you can unload table to S3 bucket. You have to use the PostgreSQL or psql to export Redshift table to local CSV format.
At the end I built the JSON manually with string concatenation,
# UPLOAD AS JSON
UNLOAD ('SELECT CHR(123)||
\'"receiver_idfa":"\'||nvl(receiver_idfa,\'\')||\'",\'||
\'"brand":"\'||nvl(brand,\'\')||\'",\'||
\'"total":\'||nvl(total,0)||\',\'||
\'"screen_dpi":\'||nvl(screen_dpi,0)||\',\'||
\'"city":"\'||nvl(city,\'\')||\'",\'||
\'"wifi":\'||nvl(convert(integer,wifi),0)||\',\'||
\'"duration":\'||nvl(duration,0)||\',\'||
\'"carrier":"\'||nvl(carrier,\'\')||\'",\'||
\'"screen_width":\'||nvl(screen_width,0)||\',\'||
\'"time":\'||nvl("time",0)||\',\'||
\'"ts":"\'||nvl(ts,\'1970-01-01 00:00:00\')||\'",\'||
\'"month":\'||nvl(month,0)||\',\'||
\'"year":\'||nvl(year,0)||\',\'||
\'"day":\'||nvl(day,0)||\',\'||
\'"hour":\'||nvl(hour,0)||\',\'||
\'"minute":\'||nvl(minute,0)||
chr(125) from event_logs')
TO 's3://BUCKET/PREFIX/KEY'
WITH CREDENTIALS AS 'CREDENTIALS...'
GZIP
DELIMITER AS '\t'
;
Here,
nvl
function is used for replacing nullsconvert
is used for replacing booleans to int||
is concatenation operator in Redshiftchr
is used to generate {
and }
characterThis operation is not as fast as just unloading as csv. It'll take 2-3x longer time. But as we need to do it once, its fine. I unloaded around 1600 million records and imported all of them in Spark successfully.
Note: Parsing json by spark is not the efficient way. There are other formats which are faster, like parquet file, sequence file. So for spark this might not be a correct path. But for unloading as JSON you can use this solution.
Check out the spark-redshift
library, which is designed to allow Apache Spark to do bulk reads from Redshift using UNLOAD
; it automatically manages the escaping and schema handling.
You can either run Spark queries directly against the data loaded from Redshift or you can save the Redshift data into a typed format like Parquet and then query that data.
Full disclosure: I'm the primary maintainer of that library.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With