Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why do I need to set the `transformation_ctx` parameter when calling transformation and sink operations for AWS Glue bookmark to work?

The AWS Glue Bookmark document (https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html) seems to suggest one has to pass a transformation_ctx parameter to source, transform and sink operation for the bookmark to work. This is reflected in the sample code in that page, where invocation of all of create_dynamic_frame.from_catalog(), ApplyMapping.apply() and write_dynamic_frame.from_options() are passed with a transformation_ctx value.

I can understand the point to pass such a transformation_ctx to create_dynamic_frame.from_catalog() method, as AWS Glue needs to store the information about files which have been read in the bookmark under the given transformation_ctx key.

However, I don't understand why this is also necessary for methods like ApplyMapping.apply() and write_dynamic_frame.from_options(). To put it another way, what is the state information these operations need to store in the bookmark? If I don't pass transformation_ctx to these methods, what problems will this cause?

like image 799
victorx Avatar asked Jun 24 '20 05:06

victorx


People also ask

What is Transformation_ctx in AWS Glue?

The transformation_ctx parameter is used to identify state information within a job bookmark for the given operator. Specifically, AWS Glue uses transformation_ctx to index the key to the bookmark state. For job bookmarks to work properly, enable the job bookmark parameter and set the transformation_ctx parameter.

What should the solutions architect do to prevent AWS Glue from reprocessing old data?

AWS Glue generates the required Python or Scala code, which you can customize as per your data transformation needs. In the Advanced properties section, choose Enable in the Job bookmark list to avoid reprocessing old data.

What is job bookmark in AWS Glue?

AWS Glue provides a serverless environment to extract, transform, and load a large number of datasets from several sources for analytics purposes. It has a feature called job bookmarks to process incremental data when rerunning a job on a scheduled interval.


1 Answers

I had the same doubts about the bookmarking months ago (October 2019) and since the documentation provided by Amazon is not very clear I opened a support case to understand more how it is implemented.

In my Glue Job there was:

  • A read function from S3 (glue_context.create_dynamic_frame.from_options)
  • A ResolveChoice.apply
  • A write function to Redshift (glue_context.write_dynamic_frame.from_jdbc_conf)

All of these operations has the transformation_ctx value, I tested different possible behaviours (same transformation_ctx for all, different, fixed values, dynamic values ecc).

After many message with the AWS support they confirm that the bookmarking works only on the read function (They also said with only S3 as a source but I didn't test it), so I ask if the transformation_ctx is useless in the ResolveChoice (and write function too) and they said YES! They confirmed that doesn't make any difference.

Futhermore for the write function it doesn't change anything, so there is no bookmark logic, no "avoid function" if it has been already run before.

like image 108
Hyruma92 Avatar answered Sep 30 '22 13:09

Hyruma92