Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to load data from S3 to PostgreSQL RDS

I have a need to load data from S3 to Postgres RDS (around 50-100 GB) I don't have the option to use AWS Data Pipeline and I am looking for something similar to using the COPY command to load data in S3 into Amazon Redshift.

I would appreciate any suggestions on how I can accomplish this.

like image 990
Raj Avatar asked Oct 29 '22 22:10

Raj


1 Answers

Originally, this answer was trying to use the S3 to Postgres RDS Functionality. That whole enterprise failed (see below).

The way I have finally been able to do this is:

  1. Set-up an EC2 instance with psql installed (see below near end of post)
  2. Copy the relevant CSVs to import from S3 to the local instance
  3. Use the psql /copy command to import the files up

This last part is really, really important. If you use the SQL COPY command the entire RDS Postgres role structure will frustrate you to no end. It has a wonky SUPERRDSADMIN role which is not very super at all. However, if you use the psql /copy commany you apparently can do anything. I have confirmed this be the case and have started my uploads succesfully. I will come back and re-edit this post (time permitting) to add relevant documentation steps for the above.

Caveat Emptor: The post below was all the original work I had done trying to get this implemented. I don't want to bury the lead despite multiple efforts (including what can only be described as pathetic tech support from AWS) I don't believe that this feature is ready for prime time. Despite a very simple test environment, easy to replicate, AWS has not provided an effective way to not get the copy statement to crap out as follows:

The actual call to aws_s3.table_import_from_s3(...) is reporting a permission problem between RDS and S3. From my research work with psql this appears to be a C library, probably installed by AWS.

NOTICE: CURL error code: 28 when attempting to validate pre-signed URL, 1 attempt(s) remaining NOTICE: HINT: make sure your instance is able to connect with S3.

S3 to Postgres RDS Functionality Now Added

On 2019-04-24 AWS released functionality allowing a Postgres RDS to load directly from S3. You can read the announcement here, and see the documentation page here.

I am sharing with the OP because this appears to be the AWS supported way of solving the question posed.

Key summary points:

  • Requires Postgres 11.1 or greater
  • Need access to psql and the ability to connect it to the RDS instance
  • Need to install the aws_s3 extension which pulls in aws_commons.
  • You can get to the S3 bucket by specifying credentials or by assigning IAM roles to RDS
  • It advertises supporting all of the same data formats as the postgres COPY command
  • It currently only appears to support a single file at a time (ie no regex)

The instructions are fairly detailed and provide a variety of paths to configuring (AWS CLI scripts, Console instructions, etc). Additionally, the option to use your IAM keys rather than have to set-up roles is nice.

I did not find a way to download just psql, so I had to bring down a full postgres install down to my mac, but that was no big deal with brew:

brew install postgres

and since the DB service does not get activated it is the quickest way to get psql.

Update: Decided that having psql on my mac was a security hole, port forwarding, etc. I found that there is a simple Postgres install available for AMI Linux 2 under the AMI Extras rubric. The install command is fairly simple on your ami instance type.

sudo amazon-linux-extras install postgresql10

psql is fairly easy to use, however, important to keep in mind that any instructions to psql itself are escaped by a \. Documentation on psql can be found here. Recommend going through it at least once before executing the AWS recommended scripts.

To the extent you run tight security and have access to your RDS instances seriously restricted (which I do) don't forget to open up the ports from your AMI instance running Postgres to your RDS instance.

If your preference is a GUI then you can try to use PGAdmin4. It is the AWS recommended way of connecting to RDS Postgres instances according to the docs. I was unable to get any of the SSH tunneling features to work (which is why I ended up doing the localhost SSH mapping that I used for psql). I also found it to be rather buggy in other ways. Reading reviews of the product it seems that version 4 may not be the stablest of releases.

like image 126
zenzei Avatar answered Nov 15 '22 06:11

zenzei