Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWS Glue jobs not writing to S3

I have just been playing around with Glue but have yet to get it to successfully create a new table in an existing S3 bucket. The job will execute without error but there is never any output in S3.

Here's what the auto generated code is:

glueContext.write_dynamic_frame.from_options(frame = applymapping1, 
connection_type = "s3", connection_options = {"path": 
"s3://glueoutput/output/"}, format = "json", transformation_ctx = 
"datasink2") 

Have tried all variations of this - with name of file (that doesn't exist yet), in root folder of bucket, trailing slash and without. The role being used has full access to S3. Tried creating buckets in different regions. No file is ever created though. Again console says its successful.

like image 469
billobo Avatar asked Sep 21 '17 05:09

billobo


People also ask

Can Glue write to S3?

It can read and write to the S3 bucket. Type: Spark. Glue version: Spark 2.4, Python 3. This job runs: A new script to be authored by you.

Why is my AWS Glue job not writing logs to Amazon CloudWatch?

If your AWS Glue jobs are not pushing logs to CloudWatch, then check the following: Be sure that your AWS Glue job has all the required AWS Identity and Access Management (IAM) permissions. Be sure that the AWS Key Management Service (AWS KMS) key allows the CloudWatch Logs service to use the key.

How does S3 read data from Glue?

Choose the Data source properties tab, and then enter the following information: S3 source type: (For Amazon S3 data sources only) Choose the option S3 location. S3 URL: Enter the path to the Amazon S3 bucket, folder, or file that contains the data for your job.

Why do Glue jobs take so long?

Some common reasons why your AWS Glue jobs take a long time to complete are the following: Large datasets. Non-uniform distribution of data in the datasets. Uneven distribution of tasks across the executors.


2 Answers

As @Drellgor suggests in his comment to the previous answer, make sure you disabled "Job Bookmarks" unless you definitely don't want to process old files.

From the documentation:

"AWS Glue tracks data that has already been processed during a previous run of an ETL job by persisting state information from the job run. This persisted state information is called a job bookmark. Job bookmarks help AWS Glue maintain state information and prevent the reprocessing of old data."

like image 81
Sinan Erdem Avatar answered Sep 24 '22 06:09

Sinan Erdem


your code is correct, just verify if there is any data at all in applymapping1 DF? you check with this command : applymapping1.toDF().show()

like image 43
letstry Avatar answered Sep 26 '22 06:09

letstry