I have a large number of events partitioned by yyyy/mm/dd/hh in S3. Every partition has about 80.000 of raw text files. Every raw file has about 1.000 Events in JSON format.
When I run the script to do my transformation:
datasource0 = glueContext.create_dynamic_frame.from_catalog(database=from_database,
                                                                table_name=from_table,
                                                                transformation_ctx="datasource0")
map0 = Map.apply(frame=datasource0, f=extract_data)
applymapping1 = ApplyMapping.apply(......)
applymapping1.toDF().write.mode('append').parquet(output_bucket, partitionBy=['year', 'month', 'day', 'hour'])
I end up with a large number of small files across partitions named like:
part-00000-a5aa817d-482c-47d0-b804-81d793d3ac88.snappy.parquet
part-00001-a5aa817d-482c-47d0-b804-81d793d3ac88.snappy.parquet
part-00002-a5aa817d-482c-47d0-b804-81d793d3ac88.snappy.parquet
Each of them is 1-3KB in size. Number corresponds roughly to the number of raw files I have.
My impression was that the Glue will take all the events from catalog, partition them the way I want and store in a single file per partition.
How do I achieve that?
You just need to set repartition(1) which will shuffle the data from all partitions to a single partition which will generate a single output file while writing. 
applymapping1.toDF()
             .repartition(1)
             .write
             .mode('append')
             .parquet(output_bucket, partitionBy=['year', 'month', 'day', 'hour'])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With