Spark FileAlreadyExistsException on Stage Failure

Question

I am trying to write a dataframe to s3 location after re-partitioning. But whenever the write stage fails and Spark retry the stage it throws FileAlreadyExistsException.

When I re-submit the job it works fine if spark completes the stage in one try.

Below is my code block

df.repartition(<some-value>).write.format("orc").option("compression", "zlib").mode("Overwrite").save(path)

I believe Spark should remove files from the failed stage before retry. I understand this will be solved if we set retry to zero but the spark stage is expected to fail and that would not be a proper solution.

Below is the error:

Job aborted due to stage failure: Task 0 in stage 6.1 failed 4 times, most recent failure: Lost task 0.3 in stage 6.1 (TID 740, ip-address, executor 170): org.apache.hadoop.fs.FileAlreadyExistsException: File already exists:s3://<bucket-name>/<path-to-object>/part-00000-c3c40a57-7a50-41da-9ce2-555753cab63a-c000.zlib.orc
    at com.amazon.ws.emr.hadoop.fs.s3.upload.plan.RegularUploadPlanner.checkExistenceIfNotOverwriting(RegularUploadPlanner.java:36)
    at com.amazon.ws.emr.hadoop.fs.s3.upload.plan.RegularUploadPlanner.plan(RegularUploadPlanner.java:30)
    at com.amazon.ws.emr.hadoop.fs.s3.upload.plan.UploadPlannerChain.plan(UploadPlannerChain.java:37)
    at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.create(S3NativeFileSystem.java:601)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:932)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:913)
    at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.create(EmrFileSystem.java:242)
    at org.apache.orc.impl.PhysicalFsWriter.<init>(PhysicalFsWriter.java:95)
    at org.apache.orc.impl.WriterImpl.<init>(WriterImpl.java:170)
    at org.apache.orc.OrcFile.createWriter(OrcFile.java:843)
    at org.apache.orc.mapreduce.OrcOutputFormat.getRecordWriter(OrcOutputFormat.java:50)
    at org.apache.spark.sql.execution.datasources.orc.OrcOutputWriter.<init>(OrcOutputWriter.scala:43)
    at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:121)
    at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120)
    at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:108)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:233)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:168)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:121)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:

I am using Spark 2.4 with EMR, Please suggest the solution.

Edit 1: Please note the issue is not related to overwrite mode, I am already using it. As the question title suggests, the issue is with leftover files in case of stage failure. May be the Spark UI clears it. enter image description here

moriarty007 · Accepted Answer

Set spark.hadoop.orc.overwrite.output.file=true in your Spark Config.

You can find more details on this config here - OrcConf.java

Spark FileAlreadyExistsException on Stage Failure

Tags:

python

dataframe

amazon-s3

apache-spark

pyspark

Arghya Saha

Video Answer

1 Answers

moriarty007

Recent Activity

Donate For Us

Spark FileAlreadyExistsException on Stage Failure

Tags:

python

dataframe

amazon-s3

apache-spark

pyspark

Arghya Saha

Video Answer

1 Answers

moriarty007

Related questions

Recent Activity

Donate For Us