OutOfMemoryError: Java heap space. 1) An easy way to solve OutOfMemoryError in java is to increase the maximum heap size by using JVM options "-Xmx512M", this will immediately solve your OutOfMemoryError.
You can resolve it by setting the partition size: increase the value of spark. sql. shuffle. partitions.
The Spark runtime segregates the JVM heap space in the driver and executors into 4 different parts: Storage Memory — JVM heap space reserved for cached data. Execution Memory — JVM heap space used by data-structures during shuffle operations (joins, group-by's and aggregations).
lang. OutOfMemoryError exception. Usually, this error is thrown when there is insufficient space to allocate an object in the Java heap. In this case, The garbage collector cannot make space available to accommodate a new object, and the heap cannot be expanded further.
I have a few suggestions:
spark.executor.memory=6g
. Make sure you're using as much memory as possible by checking the UI (it will say how much mem you're using)spark.storage.memoryFraction
. If you don't use cache()
or persist
in your code, this might as well be 0. It's default is 0.6, which means you only get 0.4 * 4g memory for your heap. IME reducing the mem frac often makes OOMs go away. UPDATE: From spark 1.6 apparently we will no longer need to play with these values, spark will determine them automatically.String
and heavily nested structures (like Map
and nested case classes). If possible try to only use primitive types and index all non-primitives especially if you expect a lot of duplicates. Choose WrappedArray
over nested structures whenever possible. Or even roll out your own serialisation - YOU will have the most information regarding how to efficiently back your data into bytes, USE IT!Dataset
to cache your structure as it will use more efficient serialisation. This should be regarded as a hack when compared to the previous bullet point. Building your domain knowledge into your algo/serialisation can minimise memory/cache-space by 100x or 1000x, whereas all a Dataset
will likely give is 2x - 5x in memory and 10x compressed (parquet) on disk.http://spark.apache.org/docs/1.2.1/configuration.html
EDIT: (So I can google myself easier) The following is also indicative of this problem:
java.lang.OutOfMemoryError : GC overhead limit exceeded
To add a use case to this that is often not discussed, I will pose a solution when submitting a Spark
application via spark-submit
in local mode.
According to the gitbook Mastering Apache Spark by Jacek Laskowski:
You can run Spark in local mode. In this non-distributed single-JVM deployment mode, Spark spawns all the execution components - driver, executor, backend, and master - in the same JVM. This is the only mode where a driver is used for execution.
Thus, if you are experiencing OOM
errors with the heap
, it suffices to adjust the driver-memory
rather than the executor-memory
.
Here is an example:
spark-1.6.1/bin/spark-submit
--class "MyClass"
--driver-memory 12g
--master local[*]
target/scala-2.10/simple-project_2.10-1.0.jar
You should configure offHeap memory settings as shown below:
val spark = SparkSession
.builder()
.master("local[*]")
.config("spark.executor.memory", "70g")
.config("spark.driver.memory", "50g")
.config("spark.memory.offHeap.enabled",true)
.config("spark.memory.offHeap.size","16g")
.appName("sampleCodeForReference")
.getOrCreate()
Give the driver memory and executor memory as per your machines RAM availability. You can increase the offHeap size if you are still facing the OutofMemory issue.
You should increase the driver memory. In your $SPARK_HOME/conf folder you should find the file spark-defaults.conf
, edit and set the spark.driver.memory 4000m
depending on the memory on your master, I think.
This is what fixed the issue for me and everything runs smoothly
Have a look at the start up scripts a Java heap size is set there, it looks like you're not setting this before running Spark worker.
# Set SPARK_MEM if it isn't already set since we also use it for this process
SPARK_MEM=${SPARK_MEM:-512m}
export SPARK_MEM
# Set JAVA_OPTS to be able to load native libraries and to set heap size
JAVA_OPTS="$OUR_JAVA_OPTS"
JAVA_OPTS="$JAVA_OPTS -Djava.library.path=$SPARK_LIBRARY_PATH"
JAVA_OPTS="$JAVA_OPTS -Xms$SPARK_MEM -Xmx$SPARK_MEM"
You can find the documentation to deploy scripts here.
I suffered from this issue a lot when using dynamic resource allocation. I had thought it would utilize my cluster resources to best fit the application.
But the truth is the dynamic resource allocation doesn't set the driver memory and keeps it to its default value, which is 1G.
I resolved this issue by setting spark.driver.memory
to a number that suits my driver's memory (for 32GB ram I set it to 18G).
You can set it using spark submit command as follows:
spark-submit --conf spark.driver.memory=18g
Very important note, this property will not be taken into consideration if you set it from code, according to Spark Documentation - Dynamically Loading Spark Properties:
Spark properties mainly can be divided into two kinds: one is related to deploy, like “spark.driver.memory”, “spark.executor.instances”, this kind of properties may not be affected when setting programmatically through SparkConf in runtime, or the behavior is depending on which cluster manager and deploy mode you choose, so it would be suggested to set through configuration file or spark-submit command line options; another is mainly related to Spark runtime control, like “spark.task.maxFailures”, this kind of properties can be set in either way.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With