I have a file up in s3 that is zipped. I would like to insert it into a RedShift database. The only way my research has found to do this is by launching an ec2 instance. Moving the file there, unzipping it, and sending it back to S3. Then to insert it into my RedShift table. But I am trying to do this all from JavaSDK from an outside machine and do not want to have to use an Ec2 instance. Is there a way to just have an EMR job unzip the file? Or insert the zipped file directly into RedShift?
Files are .zip not .gzip
The simplest way to insert a row in Redshift is to to use the INSERT INTO command and specify values for all columns. If you have 10 columns, you have to specify 10 values and they have to be in order how the table was defined:.
To read a zip file and extract data from it to R environment, we can use the download. file() to download the zip, then unzip() allows to unzip the same and extract files using read. csv().
For optimum parallelism, the ideal file size is 1–125 MB after compression.
You cannot directly insert a zipped file into Redshift as per Guy's comment.
Assuming this is not a 1 time task, I would suggest using AWS Data Pipeline to perform this work. See this example of copy data between S3 buckets. Modify the example to unzip and then gzip your data instead of simply copying it.
Use the ShellCommandActivity
to execute a shell script that performs the work. I would assume this script could invoke Java if you choose and appropriate AMI as your EC2 resource (YMMV).
Data Pipeline is highly efficient for this type of work because it will start and terminate the EC2 resource automatically plus you do not have to worry about discovering the name of the new instance in your scripts.
add gzip
option, please refer: http://docs.aws.amazon.com/redshift/latest/dg/c_loading-encrypted-files.html
we can use Java client to execute SQL
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With