Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

spark.storage.memoryFraction setting in Apache Spark

According to Spark documentation

spark.storage.memoryFraction: Fraction of Java heap to use for Spark's memory cache. This should not be larger than the "old" generation of objects in the JVM, which by default is given 0.6 of the heap, but you can increase it if you configure your own old generation size.

I found several blogs and article where it is suggested to set it to zero in yarn mode. Why is that better than set it to something close to 1? And in general, what is a reasonable value for it ?

like image 976
Bob Avatar asked Dec 29 '15 10:12

Bob


People also ask

Which is the default storage level in Spark?

By default, Spark creates one partition for each block of the file (blocks being 128MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value.

How do I set Spark memory?

To enlarge the Spark shuffle service memory size, modify SPARK_DAEMON_MEMORY in $SPARK_HOME/conf/spark-env.sh, the default value is 2g, and then restart shuffle to make the change take effect.

How do I allocate executors memory in Spark?

According to the recommendations which we discussed above: Number of available executors = (total cores/num-cores-per-executor) = 150/5 = 30. Leaving 1 executor for ApplicationManager => --num-executors = 29. Number of executors per node = 30/10 = 3. Memory per executor = 64GB/3 = 21GB.

How do I reduce the GC time on my Spark?

Decrease the InitiatingHeapOccupancyPercent value ( default is 45), to let G1 GC starts initial concurrent marking at an earlier time, so that we have higher chances to avoid full GC. Increase the ConcGCThreads value, to have more threads for concurrent marking, thus we can speed up the concurrent marking phase.


1 Answers

The Spark executor is set up into 3 regions.

  1. Storage - Memory reserved for caching
  2. Execution - Memory reserved for object creation
  3. Executor overhead.

In Spark 1.5.2 and earlier:

spark.storage.memoryFraction sets the ratio of memory set for 1 and 2. The default value is .6, so 60% of the allocated executor memory is reserved for caching. In my experience, I've only ever found that the number is reduced. Typically when a developer is getting a GC issue, the application has a larger "churn" in objects, and one of the first places for optimizations is to change the memoryFraction.

If your application does not cache any data, then setting it to 0 is something you should do. Not sure why that would be specific to YARN, can you post the articles?

In Spark 1.6.0 and later:

Memory management is now unified. Both storage and execution share the heap. So this doesnt really apply anymore.

like image 143
Joe Widen Avatar answered Sep 28 '22 02:09

Joe Widen