How to configure Spark / Glue to avoid creation of empty $_folder_$ after Glue job successful execution

Tags:

I have a simple glue etl job which is triggered by Glue workflow. It drop duplicates data from a crawler table and writes back the result into a S3 bucket. The job is completed successfully . However the empty folders that spark generates "$folder$" remain in s3. It does not look nice in the hierarchy and causes confusion. Is there any way to configure spark or glue context to hide/remove these folders after successful completion of the job?

enter image description here

---------------------S3 image --------------------- enter image description here

206

asked Jan 11 '21 13:01

Lina

1 Answers

Ok finally after few days of testing I found the solution. Before pasting the code let me summarize what I have found ...

Those $folder$ are created via Hadoop .Apache Hadoop creates these files when to create a folder in an S3 bucket. Source1 They are actually directory markers as path + /. Source 2
To change the behavior , you need to change the Hadoop S3 write configuration in Spark context. Read this and this and this
Read about S3 , S3a and S3n here and here
Thanks to @stevel 's comment here

Now the solution is to set the following configuration in Spark context Hadoop.

sc = SparkContext()
hadoop_conf = sc._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")

To avoid creation of SUCCESS files you need to set the following configuration as well : hadoop_conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")

Make sure you use the S3 URI for writing to s3 bucket. ex:

myDF.write.mode("overwrite").parquet('s3://XXX/YY',partitionBy['DDD'])

answered Sep 26 '22 16:09

Lina

Related questions
                            
                                How to run Python Spark code on Amazon Aws?
                            
                                Trying to write dataframe to file, getting org.apache.spark.SparkException: Task failed while writing rows
                            
                                API Gateway GET / PUT large files into S3
                            
                                AWS Step Function: Function .length() returned error in variable field in Choice state
                            
                                How to get latest file-name or file from S3 bucket using event triggered lambda
                            
                                Why do you want to use AWS ECS vs. ElasticBeanstalk for Docker?
                            
                                Blue green deploy for codedeploy fails ( The deployment failed because no instances were found in your green fleet. (Error code: NO_INSTANCES))
                            
                                AWS CodeCommit Public Repository
                            
                                How set name for crawled table?
                            
                                How can I stream a specific log file from multi-container Docker Elastic Beanstalk to CloudWatch?
                            
                                Send username to aws Lambda function triggered by aws Cognito user confirm
                            
                                Boto3 - Print AWS Instance Average CPU Utilization
                            
                                What is the difference between AWS WAF and AWS GuardDuty?
                            
                                AWS CodeBuild - Environment based off of image from docker hub
                            
                                How to download from AWS S3 using golang
                            
                                Called a lambda function once, it's executed twice
                            
                                How to get AWS account id as custom variable in serverless framework?
                            
                                What is the purpose of kms:GenerateDataKey in AWS?
                            
                                Having trouble reading AWS config file with python configparser
                            
                                uvicorn error on AWS EC2 with uvicorn + fastapi

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to configure Spark / Glue to avoid creation of empty $_folder_$ after Glue job successful execution

Tags:

amazon-web-services

aws-glue

aws-glue-spark

aws-glue-workflow

Lina

People also ask

1 Answers

Lina

Recent Activity

Donate For Us