I am working with Spark 2.0, the job starts by sorting the input data and storing its output on HDFS.
I was getting out of memory errors, the solution was to increase the value of "spark.shuffle.memoryFraction" from 0.2 to 0.8 and this solved the problem. But in the documentation I have found that this is a deprecated parameter.
As I understand, it was replaced by "spark.memory.fraction". How to modify this parameter while taking into account the sort and storage on HDFS?
memory. fraction – a fraction of the heap space (minus 300 MB * 1.5) reserved for execution and storage regions (default 0.6)
You can resolve it by setting the partition size: increase the value of spark. sql. shuffle. partitions.
Kryo is significantly faster and more compact than Java serialization (often as much as 10x), but does not support all Serializable types and requires you to register the classes you'll use in the program in advance for best performance. So it is not used by default because: Not every java.
From the documentation:
Although there are two relevant configurations, the typical user should not need to adjust them as the default values are applicable to most workloads:
spark.memory.fraction
expresses the size of M as a fraction of the (JVM heap space - 300MB) (default 0.6). The rest of the space (40%)
is reserved for user data structures, internal metadata in Spark, and safeguarding against OOM errors in the case of sparse and unusually
large records.spark.memory.storageFraction
expresses the size of R as a fraction of M (default 0.5). R is the storage space within M where cached blocks immune to being evicted by execution.The value of
spark.memory.fraction
should be set in order to fit this amount of heap space comfortably within the JVM’s old or “tenured” generation. Otherwise, when much of this space is used for caching and execution, the tenured generation will be full, which causes the JVM to significantly increase time spent in garbage collection.
In spark-1.6.2 I would modify the spark.storage.memoryFraction
.
As a side note, are you sure that you understand how your job behaves?
It's typical to fine tune your job starting from the memoryOverhead
, #cores , etc. firstly and then move on to the attribute you modified.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With