Export data from Amazon Redshift as JSON

Tags:

We are migrating from Redshift to Spark. I have a table in Redshift that I need to export to S3. From S3 this will be fed to Apache Spark (EMR).

I found there is only one way to export data from Redshift. And that is UNLOAD command. And unload can not export typed data. It exports csv which is a table of strings. Based on different format (quote, delimiter etc) Spark doesn't seem to recognize it well. So I am looking for a way to unload them and make sure they are read by spark with proper type.

Is there any way to unload data as JSON or other typed format that is recognizable to Spark?

466

asked Oct 25 '16 10:10

Shiplu Mokaddim

2 Answers

At the end I built the JSON manually with string concatenation,

# UPLOAD AS JSON
UNLOAD ('SELECT CHR(123)||
\'"receiver_idfa":"\'||nvl(receiver_idfa,\'\')||\'",\'||
\'"brand":"\'||nvl(brand,\'\')||\'",\'||
\'"total":\'||nvl(total,0)||\',\'||
\'"screen_dpi":\'||nvl(screen_dpi,0)||\',\'||
\'"city":"\'||nvl(city,\'\')||\'",\'||
\'"wifi":\'||nvl(convert(integer,wifi),0)||\',\'||
\'"duration":\'||nvl(duration,0)||\',\'||
\'"carrier":"\'||nvl(carrier,\'\')||\'",\'||
\'"screen_width":\'||nvl(screen_width,0)||\',\'||
\'"time":\'||nvl("time",0)||\',\'||
\'"ts":"\'||nvl(ts,\'1970-01-01 00:00:00\')||\'",\'||
\'"month":\'||nvl(month,0)||\',\'||
\'"year":\'||nvl(year,0)||\',\'||
\'"day":\'||nvl(day,0)||\',\'||
\'"hour":\'||nvl(hour,0)||\',\'||
\'"minute":\'||nvl(minute,0)||
chr(125) from event_logs')                                                                                              
TO 's3://BUCKET/PREFIX/KEY'
WITH CREDENTIALS AS 'CREDENTIALS...' 
GZIP
DELIMITER AS '\t'
;

Here,

nvl function is used for replacing nulls
convert is used for replacing booleans to int
|| is concatenation operator in Redshift
chr is used to generate { and } character

This operation is not as fast as just unloading as csv. It'll take 2-3x longer time. But as we need to do it once, its fine. I unloaded around 1600 million records and imported all of them in Spark successfully.

Note: Parsing json by spark is not the efficient way. There are other formats which are faster, like parquet file, sequence file. So for spark this might not be a correct path. But for unloading as JSON you can use this solution.

answered Oct 12 '22 22:10

Shiplu Mokaddim

Check out the spark-redshift library, which is designed to allow Apache Spark to do bulk reads from Redshift using UNLOAD; it automatically manages the escaping and schema handling.

You can either run Spark queries directly against the data loaded from Redshift or you can save the Redshift data into a typed format like Parquet and then query that data.

Full disclosure: I'm the primary maintainer of that library.

answered Oct 12 '22 22:10

Josh Rosen

Related questions
                            
                                Amazon Redshift: Copying Data Between Databases
                            
                                AWS Lambda Sporadic "No space left on device" error
                            
                                How do I get the last modified date of a directory in Amazon S3?
                            
                                How to upload very large file to S3?
                            
                                Unable to use any 3rd party module with AWS Lambdas
                            
                                Is it possible to launch an RDS instance without a VPC?
                            
                                Where can the RDS_DB_NAME setting for an Elastic Beanstalk environment be changed
                            
                                How to delete an aws cloudfront Origin Access Identity
                            
                                API Gateway CORS Issue
                            
                                How to know a EC2 instance is EC2 Classic or EC2 VPC instance?
                            
                                I can't connect from the outside to the mongo-express
                            
                                How to get current VPC CIDR using fn::att or fn::select or anyother builtin functions while setting egress in cf template
                            
                                Bucket policy that respects pre-signed URLs OR IP Address deny?
                            
                                Amazon Cloud Watch Log - PutLogEventsRequest - The given sequenceToken is invalid
                            
                                AWS RDS out of memory error when adding column
                            
                                AWS Cloud Watch: Metric Filter Value Extraction
                            
                                Can't set SSH in AWS EB CLI [closed]
                            
                                AWS.DynamoDB.DocumentClient is not providing data on put
                            
                                How to change storage class of existing key via boto3
                            
                                How do I get an exit code from an Amazon ECS Task?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Export data from Amazon Redshift as JSON

Tags:

amazon-web-services

amazon-s3

apache-spark

mapreduce

amazon-redshift

Shiplu Mokaddim

People also ask

2 Answers

Shiplu Mokaddim

Josh Rosen

Recent Activity

Donate For Us