Lately I've been running a memory-heavy spark job and started to wonder about storage levels of spark. I persisted one of my RDDs as it was used twice using StorageLevel.MEMORY_AND_DISK
. I was getting OOM Java heap space during the job. Then, when I removed the persist completely, the job has managed to go through and finish.
I always thought that the MEMORY_AND_DISK
is basically a fully safe option - if you run out of memory, it spills the object to disk, done. But now it seemed that it did not really work in the way I expected it to.
This derives two questions:
MEMORY_AND_DISK
spills the objects to disk when executor goes out of memory, does it ever make sense to use DISK_ONLY
mode (except some very specific configurations like spark.memory.storageFraction=0
)?MEMORY_AND_DISK
spills the objects to disk when executor goes out of memory, how could I fix the problem with OOM by removing the caching? Did I miss something and the problem was actually elsewhere?You can resolve it by setting the partition size: increase the value of spark. sql. shuffle. partitions.
MEMORY_ONLY_SER is the same as MEMORY_ONLY but the difference being it stores RDD as serialized objects to JVM memory. It takes lesser memory (space-efficient) than MEMORY_ONLY as it saves objects as serialized and takes an additional few more CPU cycles in order to deserialize.
You can check out the storage level using getStorageLevel() operation.
MEMORY_AND_DISK
doesn't "spill the objects to disk when executor goes out of memory".
It tells Spark to write partitions not fitting in memory to Disk so they will be loaded from there when needed.
Dealing with huge datasets you should definately consider persisting data to DISK_ONLY. https://spark.apache.org/docs/latest/rdd-programming-guide.html#which-storage-level-to-choose
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With