Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

DynamicFrame vs DataFrame

Tags:

What is the difference? I know that DynamicFrame was created for AWS Glue, but AWS Glue also supports DataFrame. When should DynamicFrame be used in AWS Glue?

like image 952
Alex Oh Avatar asked Oct 15 '18 18:10

Alex Oh


People also ask

When would you use a DynamicFrame?

A DynamicFrame is similar to a DataFrame, except that each record is self-describing, so no schema is required initially. You want to use DynamicFrame when, Data that does not conform to a fixed schema.

What is a DynamicFrame?

A DynamicFrame is a distributed collection of self-describing DynamicRecord objects. DynamicFrame s are designed to provide a flexible data model for ETL (extract, transform, and load) operations.

What is DynamicFrame in glue?

A DynamicFrame is similar to a DataFrame , except that each record is self-describing, so no schema is required initially. Instead, AWS Glue computes a schema on-the-fly when required, and explicitly encodes schema inconsistencies using a choice (or union) type.

Does AWS Glue use Pyspark?

AWS Glue supports an extension of the PySpark Python dialect for scripting extract, transform, and load (ETL) jobs.


2 Answers

DynamicFrame is safer when handling memory intensive jobs. "The executor memory with AWS Glue dynamic frames never exceeds the safe threshold," while on the other hand, Spark DataFrame could hit "Out of memory" issue on executors. (https://docs.aws.amazon.com/glue/latest/dg/monitor-profile-debug-oom-abnormalities.html)

DynamicFrames are designed to provide maximum flexibility when dealing with messy data that may lack a declared schema. Records are represented in a flexible self-describing way that preserves information about schema inconsistencies in the data.

For example, with changing requirements, an address column stored as a string in some records might be stored as a struct in later rows. Rather than failing or falling back to a string, DynamicFrames will track both types and gives users a number of options in how to resolve these inconsistencies, providing fine grain resolution options via the ResolveChoice transforms.

DynamicFrames also provide a number of powerful high-level ETL operations that are not found in DataFrames. For example, the Relationalize transform can be used to flatten and pivot complex nested data into tables suitable for transfer to a relational database. In additon, the ApplyMapping transform supports complex renames and casting in a declarative fashion.

DynamicFrames are also integrated with the AWS Glue Data Catalog, so creating frames from tables is a simple operation. Writing to databases can be done through connections without specifying the password. Moreover, DynamicFrames are integrated with job bookmarks, so running these scripts in the job system can allow the script to implictly keep track of what was read and written.(https://github.com/aws-samples/aws-glue-samples/blob/master/FAQ_and_How_to.md)

like image 188
Fang Zhang Avatar answered Sep 30 '22 06:09

Fang Zhang


You can refer to the documentation here: DynamicFrame Class. It says,

A DynamicFrame is similar to a DataFrame, except that each record is self-describing, so no schema is required initially.

You want to use DynamicFrame when,

  • Data that does not conform to a fixed schema.

Note: You can also convert the DynamicFrame to DataFrame using toDF()

  • Refer here: def toDF
like image 25
pvy4917 Avatar answered Sep 30 '22 05:09

pvy4917