According to Spark documentation
spark.storage.memoryFraction
: Fraction of Java heap to use for Spark's memory cache. This should not be larger than the "old" generation of objects in the JVM, which by default is given 0.6 of the heap, but you can increase it if you configure your own old generation size.
I found several blogs and article where it is suggested to set it to zero in yarn mode. Why is that better than set it to something close to 1? And in general, what is a reasonable value for it ?
By default, Spark creates one partition for each block of the file (blocks being 128MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value.
To enlarge the Spark shuffle service memory size, modify SPARK_DAEMON_MEMORY in $SPARK_HOME/conf/spark-env.sh, the default value is 2g, and then restart shuffle to make the change take effect.
According to the recommendations which we discussed above: Number of available executors = (total cores/num-cores-per-executor) = 150/5 = 30. Leaving 1 executor for ApplicationManager => --num-executors = 29. Number of executors per node = 30/10 = 3. Memory per executor = 64GB/3 = 21GB.
Decrease the InitiatingHeapOccupancyPercent value ( default is 45), to let G1 GC starts initial concurrent marking at an earlier time, so that we have higher chances to avoid full GC. Increase the ConcGCThreads value, to have more threads for concurrent marking, thus we can speed up the concurrent marking phase.
The Spark executor is set up into 3 regions.
In Spark 1.5.2 and earlier:
spark.storage.memoryFraction sets the ratio of memory set for 1 and 2. The default value is .6, so 60% of the allocated executor memory is reserved for caching. In my experience, I've only ever found that the number is reduced. Typically when a developer is getting a GC issue, the application has a larger "churn" in objects, and one of the first places for optimizations is to change the memoryFraction.
If your application does not cache any data, then setting it to 0 is something you should do. Not sure why that would be specific to YARN, can you post the articles?
In Spark 1.6.0 and later:
Memory management is now unified. Both storage and execution share the heap. So this doesnt really apply anymore.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With