Spark avoid creating _temporary directory in S3

Tags:

I need to upload a dataframe to S3 bucket but I do not have delete permissions on the bucket. Is there any way I can avoid creating this _temporary directory on S3? Maybe any way in spark to use local FS for _temporary directory and then uploading final resulting file to S3 bucket or totally avoid _temporary directory.

Thanks in advance.

902

asked Oct 10 '17 11:10

Shubham Jain

2 Answers

No.

Data is written into _temporary/jobAttemptID/taskAttemptID/ and then renamed into the dest dir during task/job commit.

What you can do is write to hdfs for your jobs and then copy up using distcp. There are lots of advantages for this, not least being "with a consistent filesystem you don't run the risk of data loss you have from the s3n or s3a connectors"

2019-07-11 Update. The Apache Hadoop S3A committers let you commit work without the temp folder or rename, delivering performance and correct results even against an inconsistent S3 Store. This is how you can safely commit work. Amazon EMR have their own reimplementation of this own work, albeit (currently without the complete failure semantics which Spark expects

199

answered Nov 18 '22 16:11

stevel

Yes, you can avoid creating _temporary directory when uploading dataframe to s3.

When Spark appends data to an existing dataset, Spark uses FileOutputCommitter to manage staging output files and final output files.

By default, output committer algorithm uses version 1. In this version, FileOutputCommitter has two methods, commitTask and commitJob. commitTask moves data generated by a task from the task temporary directory to job temporary directory and when all tasks complete, commitJob moves data to from job temporary directory to the final destination.

However, when output committer algorithm uses version 2, commitTask moves data generated by a task directly to the final destination and commitJob is basically a no-op.

How do I set spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version to 2? You can set this config by using any of the following methods:

When you launch your cluster, you can put spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2 in the Spark config.
spark.conf.set("mapreduce.fileoutputcommitter.algorithm.version", "2")
When you write data using Dataset API, you can set it in the option, i.e. dataset.write.option("mapreduce.fileoutputcommitter.algorithm.version", "2").

Read more about the output committer algorithm versions databricks-blog and mapred-default

answered Nov 18 '22 18:11

yardstick17

Related questions
                            
                                Performance impact of RDD API vs UDFs mixed with DataFrame API
                            
                                (Spark) object {name} is not a member of package org.apache.spark.ml
                            
                                How to pass parameters / properties to Spark jobs with spark-submit
                            
                                How does range partitioner work in Spark?
                            
                                How to add new field to struct column?
                            
                                Stop Structured Streaming query gracefully
                            
                                Spark broadcasted variable returns NullPointerException when run in Amazon EMR cluster
                            
                                Convert scala list to DataFrame or DataSet
                            
                                Can't find spark submit when typing spark-shell
                            
                                spark-class: line 71...No such file or directory
                            
                                Convert Row to map in spark scala
                            
                                Error when Spark 2.2.0 standalone mode write Dataframe to local single-node Kafka
                            
                                How to rename duplicated columns after join? [duplicate]
                            
                                Who can give a clear explanation for `combineByKey` in Spark?
                            
                                How to get applicationId of Spark application deployed to YARN in Scala?
                            
                                How to use functions provide by DataFrameNaFunctions class in Spark, on a Dataframe?
                            
                                Spark UDF error - Schema for type Any is not supported
                            
                                Apache Spark: Difference between parallelize and broadcast
                            
                                Issue while opening Spark shell
                            
                                pyspark: counter part of like() method in dataframe

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark avoid creating _temporary directory in S3

Tags:

amazon-s3

apache-spark

Shubham Jain

People also ask

2 Answers

stevel

yardstick17

Recent Activity

Donate For Us