I use dynamic frames to write a parquet file in S3 but if a file already exists my program append a new file instead of replace it. The sentence that I use is this:
glueContext.write_dynamic_frame.from_options(frame = table,
connection_type = "s3",
connection_options = {"path": output_dir,
"partitionKeys": ["var1","var2"]},
format = "parquet")
Is there anything like "mode":"overwrite"
that replace my parquet files?
When you create the AWS Glue jobs, you can use either the IAM role that is attached or an existing role. The Python code uses the Pandas and PyArrow libraries to convert data to Parquet. The Pandas library is already available.
A DynamicFrame is similar to a DataFrame , except that each record is self-describing, so no schema is required initially. Instead, AWS Glue computes a schema on-the-fly when required, and explicitly encodes schema inconsistencies using a choice (or union) type.
Yes, we can convert the CSV/JSON files to Parquet using AWS Glue. But this is not only the use case. You can convert to the below formats.
AWS Glue generates the required Python or Scala code, which you can customize as per your data transformation needs. In the Advanced properties section, choose Enable in the Job bookmark list to avoid reprocessing old data.
Currently AWS Glue doesn't support 'overwrite' mode but they are working on this feature.
As a workaround you can convert DynamicFrame object to spark's DataFrame and write it using spark instead of Glue:
table.toDF()
.write
.mode("overwrite")
.format("parquet")
.partitionBy("var_1", "var_2")
.save(output_dir)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With