The AWS Glue Bookmark document (https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html) seems to suggest one has to pass a transformation_ctx
parameter to source, transform and sink operation for the bookmark to work. This is reflected in the sample code in that page, where invocation of all of create_dynamic_frame.from_catalog()
, ApplyMapping.apply()
and write_dynamic_frame.from_options()
are passed with a transformation_ctx
value.
I can understand the point to pass such a transformation_ctx
to create_dynamic_frame.from_catalog()
method, as AWS Glue needs to store the information about files which have been read in the bookmark under the given transformation_ctx
key.
However, I don't understand why this is also necessary for methods like ApplyMapping.apply()
and write_dynamic_frame.from_options()
. To put it another way, what is the state information these operations need to store in the bookmark? If I don't pass transformation_ctx
to these methods, what problems will this cause?
The transformation_ctx parameter is used to identify state information within a job bookmark for the given operator. Specifically, AWS Glue uses transformation_ctx to index the key to the bookmark state. For job bookmarks to work properly, enable the job bookmark parameter and set the transformation_ctx parameter.
AWS Glue generates the required Python or Scala code, which you can customize as per your data transformation needs. In the Advanced properties section, choose Enable in the Job bookmark list to avoid reprocessing old data.
AWS Glue provides a serverless environment to extract, transform, and load a large number of datasets from several sources for analytics purposes. It has a feature called job bookmarks to process incremental data when rerunning a job on a scheduled interval.
I had the same doubts about the bookmarking months ago (October 2019) and since the documentation provided by Amazon is not very clear I opened a support case to understand more how it is implemented.
In my Glue Job there was:
All of these operations has the transformation_ctx value, I tested different possible behaviours (same transformation_ctx for all, different, fixed values, dynamic values ecc).
After many message with the AWS support they confirm that the bookmarking works only on the read function (They also said with only S3 as a source but I didn't test it), so I ask if the transformation_ctx is useless in the ResolveChoice (and write function too) and they said YES! They confirmed that doesn't make any difference.
Futhermore for the write function it doesn't change anything, so there is no bookmark logic, no "avoid function" if it has been already run before.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With