Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Issues Creating a Glue Connection to an MS SQL Server RDS

I read all the answered question over here, 5 of them. And all of them are out of date about available AWS products right now.

So, as a newbie at AWS, I would like to know how to solve my problem or what is the best approach to solve my problem using only and only AWS solutions. I would like to avoid any third party. I know I'm going to cite one of the approaches I googled but just to refer it.

Anyway, I have a goal to achieve and this is basically replacing my SQL Server 2012 Integrations Services for something using only AWS products. Right now I'm accessing an FTP server and downloading a bunch of CSV files to my drive, reading them, transforming them into my datasets and loading them into my specified tables. This process is scheduled to execute 3 times every single day.

My initial proposal was to upload files to S3, use AWS Glue Crawlers to crawl my files and fill my self-created AWS Glue Data Catalogs, them ETL to my RDS. So far I could achieve my Lambda Function to connect my FTP and upload to my S3, also I could retrieve my data using AWS Athena, just to see if all things were working fine.

But now, I'm struggling to make my ETL copy/create my table into RDS and write the data. I created My Glue Connection under the same RDS VPC, subnet and security group, also my security group has All TCP from anywhere inbound (I know, I'm not leaving this, it is just for tests) and I'm using JDBC, writing the following JDBC URL:

jdbc:sqlserver://my-database-name.xsdfxsdsfsfsx.us-east-1.rds.amazonaws.com:1433;databaseName=my-database-name

I could test my created connection using "Test Connection" inside AWS Glue, and it worked fine. But after creating my Job using the Job tutorial and running it, inside my log errors I can see this:

com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.conn.HttpHostConnectException: Connect to 167.254.77.1:8088 [/167.254.77.1] failed: Connection refused (Connection refused)

I tried to create a connection using Amazon RDS option, but on the second screen after picking the instance I am receiving the following error:

Unable to find a suitable security group. Change connection type to JDBC and retry adding your connection.

I checked my IAM and I do have the AWSGlueServiceRoleDefault role within AWS service: glue trusted service and AWSGlueServiceRole for AWS managed policy as scripted in the documentation.

I would like to know what I'm missing or how to fix it to make it work. Or even if there is any better approach to achieve my goal.

like image 258
Andrew Paes Avatar asked Oct 16 '17 12:10

Andrew Paes


1 Answers

Well, fist of all I had 2 problems to solve. Im going to show how I did solve both of them.

1 - Create a connection using the "Amazon RDS" option;

My JDBC connection was working fine because I was explicit specifying which security group I wanted to be used, I called it "sg-glue", and this "sg-glue" security group was allowed in my inbound database security group, also my "sg-glue" had "All TCP" from anywhere allowed for inbound.

So when I was trying to create my connection using "Amazon RDS" I didn't notice in time this is just an easy way to create the very same JDBC connection, but you don't have the choice to specify which security group you want to use. On this way is applied the very same database security group to this connection, and this was the reason I was seeing that security group error, as my database security group didn't have "All TCP" allowed.

Reading back the documentation I could see what I was doing wrong, in fact I was trying to set a midway security group to protect my database. But as documentaion says I need to have to give "All TCP" permission right into my databse security group, so I did reset it doing the following steps:

  1. In the Amazon RDS left navigation pane, choose Instances.
  2. Choose the Amazon RDS Engine and DB Instance name that you want to access from AWS Glue.
  3. From Instance Actions, choose See Details. On the Details tab, find the Security Groups name you will access from AWS Glue. Record the name of the security group for future reference.
  4. Choose the security group to open the Amazon EC2 console.
  5. Confirm that your Group ID from Amazon RDS is chosen, then choose the Inbound tab.
  6. Add a self-referencing rule to allow AWS Glue components to communicate. Specifically, add or confirm that there is a rule of Type All TCP, Protocol is TCP, Port Range includes all ports, and whose Source is the same security group name as the Group ID. RDS Inbound

  7. Add a rule to for outbound traffic also. Either open outbound traffic to all ports or create a self-referencing rule of Type All TCP, Protocol is TCP, Port Range includes all ports, and whose Source is the same security group name as the Group ID. enter image description here

After all these steps, both connection were working fine.

2 - Use my created connection to allow me to ETL my data to my RDS;

So having my connection working successfully, I tried to make my job connect into my RDS and create my table and write on ti. But I couldn't, I was receiving an error:

Py4JJavaError: An error occurred while calling o74.pyWriteDynamicFrame. java.lang.NullPointerException

And I didn't know why. I tried everything, than I called AWS Support. For my best luck I had such a HERO as support, called Li H. this girl worked a lot and went every single corner to find what was happening, asked every single person at Amazon and after 4 days working, meanwhile we shared my screens, created so many times new environments to find the cause of this problem, since VPC, SG, DNS, and when we were hopeless she had a bulb light over her head and asked to change my database name to "testing"...

So I created a new connection. Created a new Job using this new connection. And it worked.

Conclusion, you can't have the same name for database and instance. They need to be different names.

Public tnx special to this girl, Li H. And very nice to have such a good professional support. Also:

  • username used on the connection need to have "create table" permissions.

  • If your job writes to a Microsoft SQL Server table, and the table has columns defined as type Boolean, then the table must be predefined in the SQL Server database.

  • Your IAM role needs to have the AWS Glue policy attached on it

like image 85
Andrew Paes Avatar answered Nov 20 '22 16:11

Andrew Paes