AWS Glue - can't set spark.yarn.executor.memoryOverhead

Tags:

When running a python job in AWS Glue I get the error:

Reason: Container killed by YARN for exceeding memory limits. 5.6 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead

When running this in the beginning of the script:

print '--- Before Conf --'
print 'spark.yarn.driver.memory', sc._conf.get('spark.yarn.driver.memory')
print 'spark.yarn.driver.cores', sc._conf.get('spark.yarn.driver.cores')
print 'spark.yarn.executor.memory', sc._conf.get('spark.yarn.executor.memory')
print 'spark.yarn.executor.cores', sc._conf.get('spark.yarn.executor.cores')
print "spark.yarn.executor.memoryOverhead", sc._conf.get("spark.yarn.executor.memoryOverhead")

print '--- Conf --'
sc._conf.setAll([('spark.yarn.executor.memory', '15G'),('spark.yarn.executor.memoryOverhead', '10G'),('spark.yarn.driver.cores','5'),('spark.yarn.executor.cores', '5'), ('spark.yarn.cores.max', '5'), ('spark.yarn.driver.memory','15G')])

print '--- After Conf ---'
print 'spark.driver.memory', sc._conf.get('spark.driver.memory')
print 'spark.driver.cores', sc._conf.get('spark.driver.cores')
print 'spark.executor.memory', sc._conf.get('spark.executor.memory')
print 'spark.executor.cores', sc._conf.get('spark.executor.cores')
print "spark.executor.memoryOverhead", sc._conf.get("spark.executor.memoryOverhead")

I get following output:

--- Before Conf --

spark.yarn.driver.memory None

spark.yarn.driver.cores None

spark.yarn.executor.memory None

spark.yarn.executor.cores None

spark.yarn.executor.memoryOverhead None

--- Conf --

--- After Conf ---

spark.yarn.driver.memory 15G

spark.yarn.driver.cores 5

spark.yarn.executor.memory 15G

spark.yarn.executor.cores 5

spark.yarn.executor.memoryOverhead 10G

It seems like the spark.yarn.executor.memoryOverhead is set but why is it not recognized? I still get the same error.

I have seen other posts regarding problems with setting the spark.yarn.executor.memoryOverhead but not when it seems to be set and not working?

444

asked Aug 23 '18 13:08

2 Answers

Open Glue > Jobs > Edit your Job > Script libraries and job parameters (optional) > Job parameters near the bottom
Set the following > key: --conf value: spark.yarn.executor.memoryOverhead=1024

171

answered Oct 05 '22 23:10

HarshMarshmallow

Unfortunately the current version of the Glue doesn't support this functionality. You cannot set other parameters than using UI. In your case, instead of using AWS Glue, you can use AWS EMR service.

When I had the similar problem I tried to reduce the number of shuffles and the amount of data shuffled, and increase DPU. During the work on this problem I based on the following articles. I hope they will be useful.

http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/

https://www.indix.com/blog/engineering/lessons-from-using-spark-to-process-large-amounts-of-data-part-i/

https://umbertogriffo.gitbooks.io/apache-spark-best-practices-and-tuning/content/sparksqlshufflepartitions_draft.html

Updated: 2019-01-13

Amazon added lately new section to AWS Glue documentation which describes how to monitor and optimize Glue jobs. I think it is very useful thing to understand where is the problem related to memory issue and how to avoid it.

https://docs.aws.amazon.com/glue/latest/dg/monitor-profile-glue-job-cloudwatch-metrics.html

answered Oct 05 '22 23:10

j.b.gorski

Related questions
                            
                                What's the difference among ShuffledRDD, MapPartitionsRDD and ParallelCollectionRDD?
                            
                                Spark - GraphX - scaling connected components
                            
                                How to GROUPING SETS as operator/method on Dataset?
                            
                                How to convert from org.apache.spark.mllib.linalg.VectorUDT to ml.linalg.VectorUDT
                            
                                Spark: Is the memory required to create a DataFrame somewhat equal to the size of the input data?
                            
                                Convert Sparse Vector to Dense Vector in Pyspark
                            
                                Passing a list of tuples as a parameter to a spark udf in scala
                            
                                How to create a table as select in pyspark.sql
                            
                                How to save CSV with all fields quoted?
                            
                                PySpark: Get first Non-null value of each column in dataframe
                            
                                How to fill none values with a concrete timestamp in DataFrame?
                            
                                What is the meaning for reduceByKey(_ ++ _)
                            
                                need instance of RDD but returned class 'pyspark.rdd.PipelinedRDD'
                            
                                Spark - Read csv file with quote
                            
                                Spark Task Memory allocation
                            
                                Can spark-submit with named argument?
                            
                                Spark deep learning Import error
                            
                                How to transform structured streams with PySpark?
                            
                                How to specify driver class path when using pyspark within a jupyter notebook?
                            
                                PySpark - Compare DataFrames

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

AWS Glue - can't set spark.yarn.executor.memoryOverhead

Tags:

apache-spark

pyspark

aws-glue

KDilla

People also ask

2 Answers

HarshMarshmallow

j.b.gorski

Recent Activity

Donate For Us