Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Overwrite parquet files from dynamic frame in AWS Glue

I use dynamic frames to write a parquet file in S3 but if a file already exists my program append a new file instead of replace it. The sentence that I use is this:

glueContext.write_dynamic_frame.from_options(frame = table,
                                         connection_type = "s3",
                                         connection_options = {"path": output_dir,
                                                               "partitionKeys": ["var1","var2"]},
                                         format = "parquet")

Is there anything like "mode":"overwrite" that replace my parquet files?

like image 796
Mateo Rod Avatar asked Aug 24 '18 09:08

Mateo Rod


People also ask

Can AWS Glue convert to Parquet?

When you create the AWS Glue jobs, you can use either the IAM role that is attached or an existing role. The Python code uses the Pandas and PyArrow libraries to convert data to Parquet. The Pandas library is already available.

What is difference between dynamic frame and DataFrame?

A DynamicFrame is similar to a DataFrame , except that each record is self-describing, so no schema is required initially. Instead, AWS Glue computes a schema on-the-fly when required, and explicitly encodes schema inconsistencies using a choice (or union) type.

Can AWS Glue convert CSV to Parquet?

Yes, we can convert the CSV/JSON files to Parquet using AWS Glue. But this is not only the use case. You can convert to the below formats.

What should the solutions architect do to prevent AWS Glue from reprocessing old data?

AWS Glue generates the required Python or Scala code, which you can customize as per your data transformation needs. In the Advanced properties section, choose Enable in the Job bookmark list to avoid reprocessing old data.


1 Answers

Currently AWS Glue doesn't support 'overwrite' mode but they are working on this feature.

As a workaround you can convert DynamicFrame object to spark's DataFrame and write it using spark instead of Glue:

table.toDF()
  .write
  .mode("overwrite")
  .format("parquet")
  .partitionBy("var_1", "var_2")
  .save(output_dir)
like image 136
Yuriy Bondaruk Avatar answered Sep 18 '22 11:09

Yuriy Bondaruk