S3 Slow Down exception for Spark program [duplicate]

Tags:

amazon-s3

apache-spark

I have simple spark program running in EMR cluster trying to convert 60 GB of CSV file into parquet. When i submit the job i get below exception.

391, ip-172-31-36-116.us-west-2.compute.internal, executor 96): org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)Caused by: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Slow Down (Service: Amazon S3; Status Code: 503; Error Code: 503 Slow Down; Request ID: D13A3F4D7DD970FA; S3 Extended Request ID: gj3cPalkkOwtaf9XN/P+sb3jX0CNHu/QF9WTabkgP2ISuXcXdbvYO1Irg0O54OCvKlLz8WoR8E4=), S3 Extended Request ID: gj3cPalkkOwtaf9XN/P+sb3jX0CNHu/QF9WTabkgP2ISuXcXdbvYO1Irg0O54OCvKlLz8WoR8E4=
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1639)

223

asked May 15 '18 04:05

kalyan chakravarthy

2 Answers

503 Slow Down is a generic response from AWS services when you're doing too many requests per second.

Possible solutions:

Copy your file to HDFS first.
Do you have one 60 Gb file or a lot of files that sums up to 60 Gb? If you have a lot of small files, try to combine them first.
Try to decrease the number of partitions in your Parquet output, if you can. df.repartition(100)
Try using less Spark workers. val spark = SparkSession.builder.appName("Simple Application").master("local[1]").getOrCreate()

127

answered Sep 20 '22 14:09

Sergey Kovalev

I'm surprised that things failed; the Apache s3a client backs off when it sees a problem like this: your work is done, just more slowly.

All of Sergey's advice is good. I'd start by coalescing small files and reducing workers: a smaller cluster can deliver more performance, and save money.

One more: if you are using SSE-KMS to encrypt the data, accessing that key can trigger throttle events too; throttling shared across all applications trying to use the KMS store.

answered Sep 20 '22 14:09

stevel

Related questions
                            
                                I am getting an error while creating a simple RDD in Spark
                            
                                Spark Pipeline error
                            
                                spring autoconfiguration class is missing in META-INF/spring.factories
                            
                                NoClassDefFoundError: Could not initialize XXX class after deploying on spark standalone cluster
                            
                                How to cache partitioned dataset and use in multiple queries?
                            
                                Pyspark udf high memory utilization
                            
                                Enum equivalent in Spark Dataframe/Parquet
                            
                                Cumulative distinct count with Spark SQL
                            
                                pyspark.sql.utils.IllegalArgumentException: "Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuild in windows 10
                            
                                How handle categorical features in the latest Random Forest in Spark?
                            
                                Why is difference between sqlContext.read.load and sqlContext.read.text?
                            
                                Which would be a quicker (and better) tool for querying data stored in the Parquet format - Spark SQL, Athena or ElasticSearch?
                            
                                How does Serialized RDD occupy less space in memory?
                            
                                Error: Could not write class iw because it exceeds JVM code size limits. Method code too large
                            
                                Scala: How to combine two data frames?
                            
                                How to implement `except` in Apache Spark based on subset of columns?
                            
                                how to convert a timestamp into string (without changing timezone)?
                            
                                update a dataframe column with new values
                            
                                How YARN knows data locality in Apache spark in cluster mode
                            
                                How do I run Spark jobs concurrently in the same AWS EMR cluster ?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With