I read all the answered question over here, 5 of them. And all of them are out of date about available AWS products right now.
So, as a newbie at AWS, I would like to know how to solve my problem or what is the best approach to solve my problem using only and only AWS solutions. I would like to avoid any third party. I know I'm going to cite one of the approaches I googled but just to refer it.
Anyway, I have a goal to achieve and this is basically replacing my SQL Server 2012 Integrations Services for something using only AWS products. Right now I'm accessing an FTP server and downloading a bunch of CSV files to my drive, reading them, transforming them into my datasets and loading them into my specified tables. This process is scheduled to execute 3 times every single day.
My initial proposal was to upload files to S3, use AWS Glue Crawlers to crawl my files and fill my self-created AWS Glue Data Catalogs, them ETL to my RDS. So far I could achieve my Lambda Function to connect my FTP and upload to my S3, also I could retrieve my data using AWS Athena, just to see if all things were working fine.
But now, I'm struggling to make my ETL copy/create my table into RDS and write the data. I created My Glue Connection under the same RDS VPC, subnet and security group, also my security group has All TCP from anywhere inbound (I know, I'm not leaving this, it is just for tests) and I'm using JDBC, writing the following JDBC URL:
jdbc:sqlserver://my-database-name.xsdfxsdsfsfsx.us-east-1.rds.amazonaws.com:1433;databaseName=my-database-name
I could test my created connection using "Test Connection" inside AWS Glue, and it worked fine. But after creating my Job using the Job tutorial and running it, inside my log errors I can see this:
com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.conn.HttpHostConnectException: Connect to 167.254.77.1:8088 [/167.254.77.1] failed: Connection refused (Connection refused)
I tried to create a connection using Amazon RDS option, but on the second screen after picking the instance I am receiving the following error:
Unable to find a suitable security group. Change connection type to JDBC and retry adding your connection.
I checked my IAM and I do have the AWSGlueServiceRoleDefault role within AWS service: glue trusted service and AWSGlueServiceRole for AWS managed policy as scripted in the documentation.
I would like to know what I'm missing or how to fix it to make it work. Or even if there is any better approach to achieve my goal.
Well, fist of all I had 2 problems to solve. Im going to show how I did solve both of them.
My JDBC connection was working fine because I was explicit specifying which security group I wanted to be used, I called it "sg-glue", and this "sg-glue" security group was allowed in my inbound database security group, also my "sg-glue" had "All TCP" from anywhere allowed for inbound.
So when I was trying to create my connection using "Amazon RDS" I didn't notice in time this is just an easy way to create the very same JDBC connection, but you don't have the choice to specify which security group you want to use. On this way is applied the very same database security group to this connection, and this was the reason I was seeing that security group error, as my database security group didn't have "All TCP" allowed.
Reading back the documentation I could see what I was doing wrong, in fact I was trying to set a midway security group to protect my database. But as documentaion says I need to have to give "All TCP" permission right into my databse security group, so I did reset it doing the following steps:
Add a self-referencing rule to allow AWS Glue components to communicate. Specifically, add or confirm that there is a rule of Type All TCP, Protocol is TCP, Port Range includes all ports, and whose Source is the same security group name as the Group ID.
Add a rule to for outbound traffic also. Either open outbound traffic to all ports or create a self-referencing rule of Type All TCP, Protocol is TCP, Port Range includes all ports, and whose Source is the same security group name as the Group ID.
After all these steps, both connection were working fine.
So having my connection working successfully, I tried to make my job connect into my RDS and create my table and write on ti. But I couldn't, I was receiving an error:
Py4JJavaError: An error occurred while calling o74.pyWriteDynamicFrame. java.lang.NullPointerException
And I didn't know why. I tried everything, than I called AWS Support. For my best luck I had such a HERO as support, called Li H. this girl worked a lot and went every single corner to find what was happening, asked every single person at Amazon and after 4 days working, meanwhile we shared my screens, created so many times new environments to find the cause of this problem, since VPC, SG, DNS, and when we were hopeless she had a bulb light over her head and asked to change my database name to "testing"...
So I created a new connection. Created a new Job using this new connection. And it worked.
Conclusion, you can't have the same name for database and instance. They need to be different names.
Public tnx special to this girl, Li H. And very nice to have such a good professional support. Also:
username used on the connection need to have "create table" permissions.
If your job writes to a Microsoft SQL Server table, and the table has columns defined as type Boolean, then the table must be predefined in the SQL Server database.
Your IAM role needs to have the AWS Glue policy attached on it
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With