Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pipeline from AWS RDS to S3 using Glue

I was trying AWS Glue to migrate our current data pipeline from python scripts to AWS Glue . I was able to setup a crawler to pull the schema for the different postgres databases . However, I am facing issues in pulling data from Postgres RDS to S3 tables in Athena .

  • Is there a way to directly pull data from AWS RDS to S3 tables in Athena ?
  • If Yes , then how
  • If not then any better suggestions are always welcome

Thanks in advance !

like image 541
Eshank Jain Avatar asked Dec 11 '18 03:12

Eshank Jain


1 Answers

You can't pull data from AWS RDS to S3 using Athena. Athena is a query engine over S3 data. To be able to extract data from RDS to S3, you can run a Glue job to read from a particular RDS table and create S3 dump in parquet format which will create another external table pointing to S3 data. Then you can query that S3 data using Athena. A sample code snippet to read from RDS using Glue catalog and write parquet in S3 will look like below. There are some Glue predefined template which you can use to experiment. Start with a small table first. Please let me know if it worked out for you or any further questions/issues.

datasource0 = glueContext.create_dynamic_frame.from_options(connection_type="postgresql", connection_options = 
{"url": "jdbc-url/database",
"user": "user_name",
"password": "password",
"dbtable": "table_name"},
transformation_ctx = "datasource0")

   datasink4 = glueContext.write_dynamic_frame.from_options(frame = datasource0, connection_type = "s3", connection_options = {"path": "s3://aws-glue-tpcds-parquet/"+ tableName + "/"}, format = "parquet", transformation_ctx = "datasink4")
like image 169
Tanveer Uddin Avatar answered Sep 20 '22 00:09

Tanveer Uddin