So my limited understand of Redshift this is my plan for going about my problem...
I want to take the results of a query, and use them as an input for an EMR job. What is the best way to go about this programmaticly.
Currently my EMR job takes a flat file from S3 as the input, and I use the Amazon Java SDK, to set this job up and everything.
Should I write the output of my RedShift query to S3, and point my EMR job there, and then remove the file after the EMR job has completed?
Or does the RedShift and AWS SKD offer a more resourceful way to directly pipe the query from RedShift to EMR, cutting out the the S3 step?
Thanks
Recently spoke with memebers of Amazon Redshift Team, they said a solution for this is in the works.
The most common way is to upload the data to Amazon S3 and use the built-in features of Amazon EMR to load the data onto your cluster. You can also use the DistributedCache feature of Hadoop to transfer files from a distributed file system to the local file system.
To use the query editor on the Amazon Redshift consoleOn the navigation menu, choose Query editor, then connect to a database in your cluster. For Schema, choose public to create a new table based on that schema. Enter the following in the query editor window and choose Run to create a new table.
This is pretty easy - no need for Sqoop. Add a Cascading Lingual step at the front of your job which executes a Redshift UNLOAD
command to S3:
UNLOAD ('select_statement')
TO 's3://object_path_prefix'
[ WITH ] CREDENTIALS [AS] 'aws_access_credentials'
[ option [ ... ] ]
Then you can either process the export directly on S3, or add an S3DistCp step to bring the data onto HDFS first.
This will be a lot more performant than adding Sqoop, and a lot simpler to maintain.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With