Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark - StorageLevel (DISK_ONLY vs MEMORY_AND_DISK) and Out of memory Java heap space

Lately I've been running a memory-heavy spark job and started to wonder about storage levels of spark. I persisted one of my RDDs as it was used twice using StorageLevel.MEMORY_AND_DISK. I was getting OOM Java heap space during the job. Then, when I removed the persist completely, the job has managed to go through and finish.

I always thought that the MEMORY_AND_DISK is basically a fully safe option - if you run out of memory, it spills the object to disk, done. But now it seemed that it did not really work in the way I expected it to.

This derives two questions:

  1. If MEMORY_AND_DISK spills the objects to disk when executor goes out of memory, does it ever make sense to use DISK_ONLY mode (except some very specific configurations like spark.memory.storageFraction=0)?
  2. If MEMORY_AND_DISK spills the objects to disk when executor goes out of memory, how could I fix the problem with OOM by removing the caching? Did I miss something and the problem was actually elsewhere?
like image 382
Matek Avatar asked Sep 27 '17 23:09

Matek


People also ask

How you have overcome the Java memory heap issues in spark?

You can resolve it by setting the partition size: increase the value of spark. sql. shuffle. partitions.

When you compare MEMORY_ONLY and Memory_only_ser which one is more efficient in terms of CPU utilization?

MEMORY_ONLY_SER is the same as MEMORY_ONLY but the difference being it stores RDD as serialized objects to JVM memory. It takes lesser memory (space-efficient) than MEMORY_ONLY as it saves objects as serialized and takes an additional few more CPU cycles in order to deserialize.

How do I check the spark level of my storage?

You can check out the storage level using getStorageLevel() operation.


1 Answers

MEMORY_AND_DISK doesn't "spill the objects to disk when executor goes out of memory". It tells Spark to write partitions not fitting in memory to Disk so they will be loaded from there when needed.

Dealing with huge datasets you should definately consider persisting data to DISK_ONLY. https://spark.apache.org/docs/latest/rdd-programming-guide.html#which-storage-level-to-choose

like image 89
baitmbarek Avatar answered Oct 19 '22 18:10

baitmbarek