I have a simple glue etl job which is triggered by Glue workflow. It drop duplicates data from a crawler table and writes back the result into a S3 bucket. The job is completed successfully . However the empty folders that spark generates "$folder$" remain in s3. It does not look nice in the hierarchy and causes confusion. Is there any way to configure spark or glue context to hide/remove these folders after successful completion of the job?
---------------------S3 image ---------------------
--TempDir — Specifies an Amazon S3 path to a bucket that can be used as a temporary directory for the job. For example, to set a temporary directory, pass the following argument. AWS Glue creates a temporary bucket for jobs if a bucket doesn't already exist in a region. This bucket might permit public access.
AWS Glue generates the required Python or Scala code, which you can customize as per your data transformation needs. In the Advanced properties section, choose Enable in the Job bookmark list to avoid reprocessing old data.
On the Crawlers tab, select your crawler, and then choose Add. The trigger appears on the graph. On the graph, to the right of the job trigger that you just created, choose Add node. On the Jobs tab, select the job that you want to start when the crawler run completes, and then choose Add.
Some common reasons why your AWS Glue jobs take a long time to complete are the following: Large datasets. Non-uniform distribution of data in the datasets. Uneven distribution of tasks across the executors.
Ok finally after few days of testing I found the solution. Before pasting the code let me summarize what I have found ...
Now the solution is to set the following configuration in Spark context Hadoop.
sc = SparkContext()
hadoop_conf = sc._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
To avoid creation of SUCCESS files you need to set the following configuration as well :
hadoop_conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
Make sure you use the S3 URI for writing to s3 bucket. ex:
myDF.write.mode("overwrite").parquet('s3://XXX/YY',partitionBy['DDD'])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With