Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to connect AWS Glue to a VPC, and access private resources?

I am trying to connect to services and databases running inside a VPC (private subnets) from an AWS Glue job. The private resources should not be exposed publicly (e.g., moving to a public subnet or setting up public load balancers).

Unfortunately, AWS Glue doesn't seem to support running inside user defined VPCs. AWS does provide something called Glue Database Connections which, when used with the Glue SDK, magically set up elastic network interfaces inside the specified VPC for Glue/Spark worker nodes. The network interfaces then tunnel traffic from Glue to a specific database inside the VPC. However, this requires the location and credentials of specific databases, and it is not clear if and when other traffic (e.g., a REST call to a service) is tunnelled through the VPC.

Is there a reliable way to setup a Glue -> VPC connection that will tunnel all traffic through a VPC?

like image 596
Turiphro Avatar asked May 01 '20 10:05

Turiphro


People also ask

Does glue require VPC?

You need a VPC because the AWS Glue Job needs an Amazon Elastic network interfaces (ENI) to call the REST API over internet. The ENI is created in the private subnet with NAT Gateway using AWS Glue connection. NAT Gateway will enable outbound call to the REST API. Login to the AWS Console.

What is the difference between VPC peering and private link?

While VPC peering enables you to privately connect VPCs, AWS PrivateLink enables you to configure applications or services in VPCs as endpoints that your VPC peering connections can connect to.

Can AWS Glue connect to on premise database?

AWS Glue can also connect to a variety of on-premises JDBC data stores such as PostgreSQL, MySQL, Oracle, Microsoft SQL Server, and MariaDB. AWS Glue ETL jobs can use Amazon S3, data stores in a VPC, or on-premises JDBC data stores as a source.


2 Answers

You can create a database connection with NETWORK connection type and use that connection in your Glue job. It will allow your job to call a REST API or any other resource within your VPC.

enter image description here

https://docs.aws.amazon.com/glue/latest/dg/connection-using.html

Network (designates a connection to a data source within an Amazon Virtual Private Cloud environment (Amazon VPC))

enter image description here

https://docs.aws.amazon.com/glue/latest/dg/connection-JDBC-VPC.html

To allow AWS Glue to communicate with its components, specify a security group with a self-referencing inbound rule for all TCP ports. By creating a self-referencing rule, you can restrict the source to the same security group in the VPC and not open it to all networks.

enter image description here

like image 160
Alexandr Lihonosov Avatar answered Oct 16 '22 13:10

Alexandr Lihonosov


However, this requires the location and credentials of specific databases, and it is not clear if and when other traffic (e.g., a REST call to a service) is tunnelled through the VPC.

I agree the documentation is confusing, but according to this paragraph on the page you linked, it appears that all traffic is indeed tunneled through the VPC, since you have to have a NAT Gateway or VPC endpoints to allow Glue to access things outside the VPC once you have configured it with VPC access:

All JDBC data stores that are accessed by the job must be available from the VPC subnet. To access Amazon S3 from within your VPC, a VPC endpoint is required. If your job needs to access both VPC resources and the public internet, the VPC needs to have a Network Address Translation (NAT) gateway inside the VPC.

like image 36
Mark B Avatar answered Oct 16 '22 13:10

Mark B