Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Export data from Amazon Redshift as JSON

We are migrating from Redshift to Spark. I have a table in Redshift that I need to export to S3. From S3 this will be fed to Apache Spark (EMR).

I found there is only one way to export data from Redshift. And that is UNLOAD command. And unload can not export typed data. It exports csv which is a table of strings. Based on different format (quote, delimiter etc) Spark doesn't seem to recognize it well. So I am looking for a way to unload them and make sure they are read by spark with proper type.

Is there any way to unload data as JSON or other typed format that is recognizable to Spark?

like image 466
Shiplu Mokaddim Avatar asked Oct 25 '16 10:10

Shiplu Mokaddim


People also ask

How do I export AWS Redshift data?

The basic syntax to export your data is as below. UNLOAD ('SELECT * FROM your_table') TO 's3://object-path/name-prefix' IAM_ROLE 'arn:aws:iam::<aws-account-id>:role/<role-name>' CSV; On the first line, you query the data you want to export. Be aware that Redshift only allows a LIMIT clause in an inner SELECT statement.

Does AWS Redshift support JSON?

Amazon Redshift supports writing nested JSON data when your query result contains columns using SUPER, the native Amazon Redshift data type to store semi-structured data or documents as values. Support for exporting JSON data using UNLOAD is available in all AWS commercial Regions.

Does Amazon use JSON?

Amazon DocumentDB is a natural and convenient choice for Habby as all of our game data is JSON and Amazon DocumentDB makes it easy to read and write that JSON data to the database.

How do I export data from Redshift to CSV?

Redshift export table is done using either UNLOAD command, COPY command or PostgreSQL command. Using UNLOAD or COPY command is fasted way to export Redshift table, but with those commands you can unload table to S3 bucket. You have to use the PostgreSQL or psql to export Redshift table to local CSV format.


2 Answers

At the end I built the JSON manually with string concatenation,

# UPLOAD AS JSON
UNLOAD ('SELECT CHR(123)||
\'"receiver_idfa":"\'||nvl(receiver_idfa,\'\')||\'",\'||
\'"brand":"\'||nvl(brand,\'\')||\'",\'||
\'"total":\'||nvl(total,0)||\',\'||
\'"screen_dpi":\'||nvl(screen_dpi,0)||\',\'||
\'"city":"\'||nvl(city,\'\')||\'",\'||
\'"wifi":\'||nvl(convert(integer,wifi),0)||\',\'||
\'"duration":\'||nvl(duration,0)||\',\'||
\'"carrier":"\'||nvl(carrier,\'\')||\'",\'||
\'"screen_width":\'||nvl(screen_width,0)||\',\'||
\'"time":\'||nvl("time",0)||\',\'||
\'"ts":"\'||nvl(ts,\'1970-01-01 00:00:00\')||\'",\'||
\'"month":\'||nvl(month,0)||\',\'||
\'"year":\'||nvl(year,0)||\',\'||
\'"day":\'||nvl(day,0)||\',\'||
\'"hour":\'||nvl(hour,0)||\',\'||
\'"minute":\'||nvl(minute,0)||
chr(125) from event_logs')                                                                                              
TO 's3://BUCKET/PREFIX/KEY'
WITH CREDENTIALS AS 'CREDENTIALS...' 
GZIP
DELIMITER AS '\t'
;

Here,

  1. nvl function is used for replacing nulls
  2. convert is used for replacing booleans to int
  3. || is concatenation operator in Redshift
  4. chr is used to generate { and } character

This operation is not as fast as just unloading as csv. It'll take 2-3x longer time. But as we need to do it once, its fine. I unloaded around 1600 million records and imported all of them in Spark successfully.

Note: Parsing json by spark is not the efficient way. There are other formats which are faster, like parquet file, sequence file. So for spark this might not be a correct path. But for unloading as JSON you can use this solution.

like image 80
Shiplu Mokaddim Avatar answered Oct 12 '22 22:10

Shiplu Mokaddim


Check out the spark-redshift library, which is designed to allow Apache Spark to do bulk reads from Redshift using UNLOAD; it automatically manages the escaping and schema handling.

You can either run Spark queries directly against the data loaded from Redshift or you can save the Redshift data into a typed format like Parquet and then query that data.

Full disclosure: I'm the primary maintainer of that library.

like image 2
Josh Rosen Avatar answered Oct 12 '22 22:10

Josh Rosen