Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWS Glue ETL job from AWS Redshift to S3 fails

I am trying out AWS Glue service to ETL some data from redshift to S3. Crawler runs successfully and creates the meta table in data catalog, however when I run the ETL job ( generated by AWS ) it fails after around 20 minutes saying "Resource unavailable".

I cannot see AWS glue logs or error logs created in Cloudwatch. When I try to view them it says "Log stream not found. The log stream jr_xxxxxxxxxx could not be found. Check if it was correctly created and retry."

I would appreciate it if you could provide any guidance to resolve this issue.

like image 629
user_default Avatar asked Aug 22 '17 08:08

user_default


People also ask

Why does my AWS Glue test connection fail?

This means that AWS Glue can't use the public internet to connect to the data store. If the data store is outside the Amazon Virtual Private Cloud (Amazon VPC), then the subnet's route table must have a route to a NAT gateway in a public subnet. Otherwise, the connection times out.

Can redshift write to S3?

You can now write the results of an Amazon Redshift query to an external table in Amazon S3 either in text or Apache Parquet formats. The external table metadata will be automatically updated and can be stored in AWS Glue, AWS Lake Formation, or your Hive Metastore data catalog.

Can glue connect to S3?

Athena can connect to your data stored in Amazon S3 using the AWS Glue Data Catalog to store metadata such as table and column names. After the connection is made, your databases, tables, and views appear in Athena's query editor.


2 Answers

enter image description here

So basically, the job you add to Glue will either run if there's not too much traffic in the region your Glue is. If there are no resources available, you need to either manually re-add the job again or you can also bind yourself to events from CloudWatch via SNS.

Also, there are parameters you can pass to the job like maximunRetry and timeout.

If you have a Ressource not available, it won't trigger a retry because the job did not fail, it just didn't even started. But if you set the timeout to let's say 60 minutes, it will trigger an error after that time, decrement your retry pool and re-launch the job.

like image 109
maxeber Avatar answered Nov 13 '22 05:11

maxeber


The closest thing I see to Glue documentation on this is here:

If you encounter errors in AWS Glue, use the following solutions to help you find the source of the problems and fix them. Note The AWS Glue GitHub repository contains additional troubleshooting guidance in AWS Glue Frequently Asked Questions. Error: Resource Unavailable If AWS Glue returns a resource unavailable message, you can view error messages or logs to help you learn more about the issue. The following tasks describe general methods for troubleshooting. • A custom DNS configuration without reverse lookup can cause AWS Glue to fail. Check your DNS configuration. If you are using Amazon Route 53 or Microsoft Active Directory, make sure that there are forward and reverse lookups. For more information, see Setting Up DNS in Your VPC (p. 23). • For any connections and development endpoints that you use, check that your cluster has not run out of elastic network interfaces.

like image 39
Miguel Avatar answered Nov 13 '22 05:11

Miguel