Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to configure Spark / Glue to avoid creation of empty $_folder_$ after Glue job successful execution

I have a simple glue etl job which is triggered by Glue workflow. It drop duplicates data from a crawler table and writes back the result into a S3 bucket. The job is completed successfully . However the empty folders that spark generates "$folder$" remain in s3. It does not look nice in the hierarchy and causes confusion. Is there any way to configure spark or glue context to hide/remove these folders after successful completion of the job?

enter image description here

---------------------S3 image --------------------- enter image description here

like image 206
Lina Avatar asked Jan 11 '21 13:01

Lina


People also ask

What is temporary directory in Glue job?

--TempDir — Specifies an Amazon S3 path to a bucket that can be used as a temporary directory for the job. For example, to set a temporary directory, pass the following argument. AWS Glue creates a temporary bucket for jobs if a bucket doesn't already exist in a region. This bucket might permit public access.

What should the solutions architect do to prevent AWS Glue from reprocessing old data?

AWS Glue generates the required Python or Scala code, which you can customize as per your data transformation needs. In the Advanced properties section, choose Enable in the Job bookmark list to avoid reprocessing old data.

How can I automatically start an AWS Glue job when a crawler run completes?

On the Crawlers tab, select your crawler, and then choose Add. The trigger appears on the graph. On the graph, to the right of the job trigger that you just created, choose Add node. On the Jobs tab, select the job that you want to start when the crawler run completes, and then choose Add.

Why is my AWS Glue ETL job running for a long time?

Some common reasons why your AWS Glue jobs take a long time to complete are the following: Large datasets. Non-uniform distribution of data in the datasets. Uneven distribution of tasks across the executors.


1 Answers

Ok finally after few days of testing I found the solution. Before pasting the code let me summarize what I have found ...

  • Those $folder$ are created via Hadoop .Apache Hadoop creates these files when to create a folder in an S3 bucket. Source1 They are actually directory markers as path + /. Source 2
  • To change the behavior , you need to change the Hadoop S3 write configuration in Spark context. Read this and this and this
  • Read about S3 , S3a and S3n here and here
  • Thanks to @stevel 's comment here

Now the solution is to set the following configuration in Spark context Hadoop.

sc = SparkContext()
hadoop_conf = sc._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")

To avoid creation of SUCCESS files you need to set the following configuration as well : hadoop_conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")

Make sure you use the S3 URI for writing to s3 bucket. ex:

myDF.write.mode("overwrite").parquet('s3://XXX/YY',partitionBy['DDD'])
like image 67
Lina Avatar answered Sep 26 '22 16:09

Lina