Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark 2.0 memory fraction

I am working with Spark 2.0, the job starts by sorting the input data and storing its output on HDFS.

I was getting out of memory errors, the solution was to increase the value of "spark.shuffle.memoryFraction" from 0.2 to 0.8 and this solved the problem. But in the documentation I have found that this is a deprecated parameter.

As I understand, it was replaced by "spark.memory.fraction". How to modify this parameter while taking into account the sort and storage on HDFS?

like image 925
syl Avatar asked Sep 23 '16 12:09

syl


People also ask

What is memory fraction?

memory. fraction – a fraction of the heap space (minus 300 MB * 1.5) reserved for execution and storage regions (default 0.6)

How do I fix out of memory error in Spark?

You can resolve it by setting the partition size: increase the value of spark. sql. shuffle. partitions.

Why is KRYO faster?

Kryo is significantly faster and more compact than Java serialization (often as much as 10x), but does not support all Serializable types and requires you to register the classes you'll use in the program in advance for best performance. So it is not used by default because: Not every java.


1 Answers

From the documentation:

Although there are two relevant configurations, the typical user should not need to adjust them as the default values are applicable to most workloads:

  • spark.memory.fraction expresses the size of M as a fraction of the (JVM heap space - 300MB) (default 0.6). The rest of the space (40%)
    is reserved for user data structures, internal metadata in Spark, and safeguarding against OOM errors in the case of sparse and unusually
    large records.
  • spark.memory.storageFraction expresses the size of R as a fraction of M (default 0.5). R is the storage space within M where cached blocks immune to being evicted by execution.

The value of spark.memory.fraction should be set in order to fit this amount of heap space comfortably within the JVM’s old or “tenured” generation. Otherwise, when much of this space is used for caching and execution, the tenured generation will be full, which causes the JVM to significantly increase time spent in garbage collection.

In spark-1.6.2 I would modify the spark.storage.memoryFraction.


As a side note, are you sure that you understand how your job behaves?

It's typical to fine tune your job starting from the memoryOverhead, #cores , etc. firstly and then move on to the attribute you modified.

like image 190
gsamaras Avatar answered Oct 05 '22 12:10

gsamaras