Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is transformation_ctx used for in aws glue?

There are a lot of methods in API which received this with default "" value.

Is it just string marker but again what it purpose?

like image 568
Cherry Avatar asked Jan 17 '18 12:01

Cherry


People also ask

What is a dynamic frame in glue?

A DynamicFrame is similar to a DataFrame , except that each record is self-describing, so no schema is required initially. Instead, AWS Glue computes a schema on-the-fly when required, and explicitly encodes schema inconsistencies using a choice (or union) type.

What is DataSink AWS Glue?

abstract class DataSink. The writer analog to a DataSource . DataSink encapsulates a destination and a format that a DynamicFrame can be written to.

What is a GlueContext?

GlueContext is the entry point for reading and writing a DynamicFrame from and to Amazon Simple Storage Service (Amazon S3), the AWS Glue Data Catalog, JDBC, and so on. This class provides utility functions to create DataSource trait and DataSink objects that can in turn be used to read and write DynamicFrame s.

How do I join two tables in AWS Glue?

On the Node properties tab, enter a name for the node in the job diagram. In the Node properties tab, under the heading Node parents, add a parent node so that there are two datasets providing inputs for the join. The parent can be a data source node or a transform node. A join can have only two parent nodes.


2 Answers

Many of the AWS Glue PySpark dynamic frame methods include an optional parameter named transformation_ctx, which is used to identify state information for a job bookmark. If you do not pass in the transformation_ctx parameter, then job bookmarks are not enabled for a dynamic frame or table used in the method.

https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html

like image 55
이재승 Avatar answered Oct 20 '22 01:10

이재승


I think this is what is going on. I wish the AWS docs would explicitly state it.

Bookmarks alone would only let you pick up at the next piece of data (e.g. next file in S3). But for a complex job with Dynamic Frames, the job itself it stateful. To resume processing, you need to not only pick up with the next piece of input, but also restore the state you had built up within your Dynamic Frames during the last run. The transformation_ctx is like a filename for saving the Dynamic Frame state. You have to name it, because AWS Glue isn't going to analyze your script to figure out which dynamic frame invocation is which.

Inferred primarily from Tracking Processed Data Using Job Bookmarks, which is the same page that other answers linked, but has somewhat clarified text since they quoted it:

Many of the AWS Glue PySpark dynamic frame methods include an optional parameter named transformation_ctx, which is a unique identifier for the ETL operator instance. The transformation_ctx parameter is used to identify state information within a job bookmark for the given operator. Specifically, AWS Glue uses transformation_ctx to index the key to the bookmark state.

like image 44
Lorrin Avatar answered Oct 20 '22 01:10

Lorrin