I'm relatively new to PySpark. I have been trying to cache a 30GB data because I need to perform clustering on it. So performing any action like count 
 initially I was getting some heap space issue. So I googled and found that increasing the executor/driver memory will do it for me. So, here's my current configuration
SparkConf().set('spark.executor.memory', '45G')
.set('spark.driver.memory', '80G')
.set('spark.driver.maxResultSize', '10G')
But now I'm getting this garbage collection issue. I checked SO but everywhere the answers are quite vague. People are suggesting to play with the configuration. Is there any better way to figure what the configuration should be? I know that this is just a debug exception and I can turn it off. But still I want to learn a bit of maths for calculating the configurations on my own.
I'm currently on a server with 256GB RAM. Any help is appreciated. Thanks in advance.
To enlarge the Spark shuffle service memory size, modify SPARK_DAEMON_MEMORY in $SPARK_HOME/conf/spark-env.sh, the default value is 2g, and then restart shuffle to make the change take effect.
The GC Overhead Limit Exceeded error is an indication of a resource exhaustion i.e. memory. The JVM throws this error if the Java process spends more than 98% of its time doing GC and only less than 2% of the heap is recovered in each execution.
spark. driver. maxResultSize. Sets a limit on the total size of serialized results of all partitions for each Spark action (such as collect ). Jobs will fail if the size of the results exceeds this limit; however, a high limit can cause out-of-memory errors in the driver.
How many cores does your server/cluster have?
What this GC error is saying is that spark has spent at least 98% of the run time garbage collecting (cleaning up unused objects from memory) but has managed to free <2% of memory while doing so. I don't think its avoidable, as you suggest, because it indicates that memory is almost full and garbage collection is needed. Suppressing this message would likely just lead to an out of memory error shortly afterwards. This link will give you the details about what this error means. Solving it can be as simple as messing around with config settings, as you have mentioned, but it can also mean you need code fixes. Reducing how many temporary objects are being stored, making your dataframe as compact as it could be (encoding strings as indices, for example), and performing joins or other operations at the right time (most memory efficient) can all help. Look into broadcasting smaller dataframes for joins. Its tough to suggest anything without seeing code., as will this resource.
For your spark config tuning, this link should provide all the info you need. Your config settings seem very high at first glance, but I don't know your cluster setup.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With