Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to pipe data from AWS Postgres RDS to S3 (then Redshift)?

I'm using AWS data pipeline service to pipe data from a RDS MySql database to s3 and then on to Redshift, which works nicely.

However, I also have data living in an RDS Postres instance which I would like to pipe the same way but I'm having a hard time setting up the jdbc-connection. If this is unsupported, is there a work-around?

"connectionString": "jdbc:postgresql://THE_RDS_INSTANCE:5432/THE_DB”
like image 505
jenswirf Avatar asked Nov 06 '14 14:11

jenswirf


People also ask

How do I export RDS Postgres to Amazon S3?

AWS RDS Postgres Export to S3: Exporting Data to Amazon S3 To start the process of AWS RDS Postgres Export to S3 you need to export the data stored in an RDS instance for your PostgreSQL database to an Amazon S3 bucket, you first need to ensure that your RDS instance version for PostgreSQL supports Amazon S3 exports.

How to extract data from Postgres RDS into S3 using data pipeline?

You can define a copy-activity in the Data Pipeline interface to extract data from a Postgres RDS instance into S3. Create a data node of the type SqlDataNode. Specify table name and select query.

Does AWS extend copy command in RDS PostgreSQL?

I wish AWS extends COPY command in RDS Postgresql as they did in Redshift. But for now they haven't and we have to do it by ourselves. Install awscli on your EC2 box (it might have been installed by default) Use aws s3 sync or aws s3 cp commmands to download from s3 to your local directory

How do I use Amazon Simple Storage service with RDS for PostgreSQL?

Before you can use Amazon Simple Storage Service with your RDS for PostgreSQL DB instance, you need to install the aws_s3 extension. This extension provides functions for exporting data from an RDS for PostgreSQL DB instance to an Amazon S3 bucket. It also provides functions for importing data from an Amazon S3.


2 Answers

Nowadays you can define a copy-activity to extract data from a Postgres RDS instance into S3. In the Data Pipeline interface:

  1. Create a data node of the type SqlDataNode. Specify table name and select query
  2. Setup the database connection by specifying RDS instance ID (the instance ID is in your URL, e.g. your-instance-id.xxxxx.eu-west-1.rds.amazonaws.com) along with username, password and database name.
  3. Create a data node of the type S3DataNode
  4. Create a Copy activity and set the SqlDataNode as input and the S3DataNode as output
like image 137
PeterssonJesper Avatar answered Sep 23 '22 13:09

PeterssonJesper


this doesn't work yet. aws hasnt built / released the functionality to connect nicely to postgres. you can do it in a shellcommandactivity though. you can write a little ruby or python code to do it and drop that in a script on s3 using scriptUri. you could also just write a psql command to dump the table to a csv and then pipe that to OUTPUT1_STAGING_DIR with "staging: true" in that activity node.

something like this:

{
  "id": "DumpCommand",
  "type": "ShellCommandActivity",
  "runsOn": { "ref": "MyEC2Resource" },
  "stage": "true",
  "output": { "ref": "S3ForRedshiftDataNode" },
  "command": "PGPASSWORD=password psql -h HOST -U USER -d DATABASE -p 5432 -t -A -F\",\" -c \"select blah_id from blahs\" > ${OUTPUT1_STAGING_DIR}/my_data.csv"
}

i didn't run this to verify because it's a pain to spin up a pipeline :( so double check the escaping in the command.

  • pros: super straightforward and requires no additional script files to upload to s3
  • cons: not exactly secure. your db password will be transmitted over the wire without encryption.

look into the new stuff aws just launched on parameterized templating data pipelines: http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-custom-templates.html. it looks like it will allow encryption of arbitrary parameters.

like image 29
xgess Avatar answered Sep 24 '22 13:09

xgess