I am working with Spark to write data into S3 using the S3A URI.
I am also utilizing the s3-external-1.amazonaws.com endpoint to avoid the read-after-write eventual consistency issue on us-east1.
The following problem happens when trying to write some data to S3 (it's actually a move operation):
com.amazonaws.services.s3.model.MultiObjectDeleteException: Status Code: 0, AWS Service: null, AWS Request ID: null, AWS Error Code: null, AWS Error Message: One or more objects could not be deleted, S3 Extended Request ID: null
at com.amazonaws.services.s3.AmazonS3Client.deleteObjects(AmazonS3Client.java:1745)
at org.apache.hadoop.fs.s3a.S3AFileSystem.delete(S3AFileSystem.java:687)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.cleanupJob(FileOutputCommitter.java:381)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:314)
at org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:230)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:151)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:256)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139)
at org.apache.spark.sql.DataFrameWriter.orc(DataFrameWriter.scala:346)
at com.mgmg.memengine.stream.app.persistentEventStreamBootstrap$$anonfun$setupSsc$3.apply(persistentEventStreamBootstrap.scala:122)
at com.mgmg.memengine.stream.app.persistentEventStreamBootstrap$$anonfun$setupSsc$3.apply(persistentEventStreamBootstrap.scala:112)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:661)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:661)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:49)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
at scala.util.Try$.apply(Try.scala:161)
Object(s):
This can also be caused by race conditions in which >1 process tries to delete the path; HADOOP-14101 shows that.
In that specific case, you should be able to make the stack trace go away by setting the hadoop option fs.s3a.multiobjectdelete.enable
to false
.
Update, 2017-02-23
Having written some tests for this, I've not been able to replicate it for deletion of nonexistent paths, but have done it for permission problems. Assume that's the cause for now, though we would welcome more stack traces to help identify the problem. HADOOP-11572 covers the issue, including patches, documentation and for better logging of the problem (i.e. logging the failed paths and the specific errors).
I ran into this problem when I upgraded to Spark 2.0.0, and it turned out to be a missing S3 permission. I'm currently running Spark 2.0.0 using aws-java-sdk-1.7.4 and hadoop-aws-2.7.2 as dependencies.
To fix the issue, I had to add the s3:Delete*
Action into the corresponding IAM policy. Depending on how your environment is set up, this might be a policy on the S3 bucket, a policy for the user whose SECRET_KEY your Hadoop s3a library is connecting as, or an IAM role policy for the EC2 instance where Spark is running.
In my case, my working IAM role policy now looks like this:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:Delete*", "s3:Get*", "s3:List*", "s3:PutObject"
],
"Resource": "arn:aws:s3:::mybucketname/*"
}
]
}
This is a quick change through either the S3 or IAM AWS consoles and should apply immediately without requiring a Spark cluster restart. If you're not sure how to edit policies, I've provided a bit more detail here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With