Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWS Glue write parquet with partitions

I am able to write to parquet format and partitioned by a column like so:

jobname = args['JOB_NAME']
#header is a spark DataFrame
header.repartition(1).write.parquet('s3://bucket/aws-glue/{}/header/'.format(jobname), 'append', partitionBy='date')

But I am not able to do this with Glue's DynamicFrame.

header_tmp = DynamicFrame.fromDF(header, glueContext, "header")
glueContext.write_dynamic_frame.from_options(frame = header_tmp, connection_type = "s3", connection_options = {"path": 's3://bucket/output/header/'}, format = "parquet")

I have tried passing the partitionBy as a part of connection_options dict, since AWS docs say for parquet Glue does not support any format options, but that didn't work.

Is this possible, and how? As for reasons for doing it this way, I thought it was needed for job bookmarking to work, as that is not working for me currently.

like image 541
stewart99 Avatar asked Mar 06 '18 23:03

stewart99


2 Answers

I use some of the columns from my dataframe as the partionkeys object:

glueContext.write_dynamic_frame \
    .from_options(
        frame = some_dynamic_dataframe, 
        connection_type = "s3", 
        connection_options =  {"path":"some_path", "partitionKeys": ["month", "day"]},
        format = "parquet")
like image 89
Dan K Avatar answered Oct 16 '22 19:10

Dan K


From AWS Support (paraphrasing a bit):

As of today, Glue does not support partitionBy parameter when writing to parquet. This is in the pipeline to be worked on though.

Using the Glue API to write to parquet is required for job bookmarking feature to work with S3 sources.

So as of today it is not possible to partition parquet files AND enable the job bookmarking feature.

Edit: today (3/23/18) I found in the documentations:

glue_context.write_dynamic_frame.from_options(
frame = projectedEvents,
connection_options = {"path": "$outpath", "partitionKeys": ["type"]},
format = "parquet")

That option may have always been there and both myself and the AWS support person missed it, or it was only added recently. Either way, it seems like it is possible now.

like image 29
stewart99 Avatar answered Oct 16 '22 21:10

stewart99