Pyspark: java.lang.OutOfMemoryError: GC overhead limit exceeded

Tags:

I'm relatively new to PySpark. I have been trying to cache a 30GB data because I need to perform clustering on it. So performing any action like count initially I was getting some heap space issue. So I googled and found that increasing the executor/driver memory will do it for me. So, here's my current configuration

SparkConf().set('spark.executor.memory', '45G')
.set('spark.driver.memory', '80G')
.set('spark.driver.maxResultSize', '10G')

But now I'm getting this garbage collection issue. I checked SO but everywhere the answers are quite vague. People are suggesting to play with the configuration. Is there any better way to figure what the configuration should be? I know that this is just a debug exception and I can turn it off. But still I want to learn a bit of maths for calculating the configurations on my own.

I'm currently on a server with 256GB RAM. Any help is appreciated. Thanks in advance.

332

asked Aug 29 '18 05:08

lu5er

1 Answers

How many cores does your server/cluster have?

What this GC error is saying is that spark has spent at least 98% of the run time garbage collecting (cleaning up unused objects from memory) but has managed to free <2% of memory while doing so. I don't think its avoidable, as you suggest, because it indicates that memory is almost full and garbage collection is needed. Suppressing this message would likely just lead to an out of memory error shortly afterwards. This link will give you the details about what this error means. Solving it can be as simple as messing around with config settings, as you have mentioned, but it can also mean you need code fixes. Reducing how many temporary objects are being stored, making your dataframe as compact as it could be (encoding strings as indices, for example), and performing joins or other operations at the right time (most memory efficient) can all help. Look into broadcasting smaller dataframes for joins. Its tough to suggest anything without seeing code., as will this resource.

For your spark config tuning, this link should provide all the info you need. Your config settings seem very high at first glance, but I don't know your cluster setup.

178

answered Oct 14 '22 09:10

Keshinko

Related questions
                            
                                Windows error while running standalone pyspark
                            
                                IllegalAccessError in Spark caused by async-http-client
                            
                                Apache Spark: In SparkSql, are sql's vulnerable to Sql Injection [duplicate]
                            
                                rank() function usage in Spark SQL
                            
                                Spark reading from Postgres JDBC table slow
                            
                                Scala Spark connect to remote cluster
                            
                                Column features must be of type org.apache.spark.ml.linalg.VectorUDT
                            
                                failing to connect to spark driver when submitting job to spark in yarn mode
                            
                                How to convert the group by function to data frame
                            
                                Ubuntu install apache spark via apt-get
                            
                                How can you update values in a dataset?
                            
                                How to add sparse vectors after group by, using Spark SQL?
                            
                                Understanding Apache Spark RDD task serialization
                            
                                Why does Kafka Direct Stream create a new decoder for every message?
                            
                                How to compute statistics on a streaming dataframe for different type of columns in a single query?
                            
                                ArrayIndexOutOfBoundsException when reading csv file in spark
                            
                                Difference between createOrReplaceGlobalTempView and createOrReplaceTempView
                            
                                How to write integration tests for Sparks new Structured Streaming?
                            
                                Spark can't find the application class itself (ClassNotFoundException) in spark-submit with SBT assembly JAR
                            
                                How to read a compressed (gzip) file without extension in Spark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pyspark: java.lang.OutOfMemoryError: GC overhead limit exceeded

Tags:

apache-spark

apache-spark-sql

pyspark

lu5er

People also ask

1 Answers

Keshinko

Recent Activity

Donate For Us