Spark - StorageLevel (DISK_ONLY vs MEMORY_AND_DISK) and Out of memory Java heap space

Tags:

Lately I've been running a memory-heavy spark job and started to wonder about storage levels of spark. I persisted one of my RDDs as it was used twice using StorageLevel.MEMORY_AND_DISK. I was getting OOM Java heap space during the job. Then, when I removed the persist completely, the job has managed to go through and finish.

I always thought that the MEMORY_AND_DISK is basically a fully safe option - if you run out of memory, it spills the object to disk, done. But now it seemed that it did not really work in the way I expected it to.

This derives two questions:

If MEMORY_AND_DISK spills the objects to disk when executor goes out of memory, does it ever make sense to use DISK_ONLY mode (except some very specific configurations like spark.memory.storageFraction=0)?
If MEMORY_AND_DISK spills the objects to disk when executor goes out of memory, how could I fix the problem with OOM by removing the caching? Did I miss something and the problem was actually elsewhere?

382

asked Sep 27 '17 23:09

Matek

1 Answers

MEMORY_AND_DISK doesn't "spill the objects to disk when executor goes out of memory". It tells Spark to write partitions not fitting in memory to Disk so they will be loaded from there when needed.

Dealing with huge datasets you should definately consider persisting data to DISK_ONLY. https://spark.apache.org/docs/latest/rdd-programming-guide.html#which-storage-level-to-choose

answered Oct 19 '22 18:10

baitmbarek

Related questions
                            
                                Clojure GUI programming is hard
                            
                                SBT Compiler Plugin as Transitive Dependency
                            
                                IntelliJ - Remote Scala Compile Server
                            
                                Error: value seq is not a member of object slick.dbio.DBIO
                            
                                Execution Context and Dispatcher - Best practices, useful configurations and Documentation
                            
                                How to run all scalatest of a multi-modules sbt with intellij?
                            
                                Memory leak in Scala and processes
                            
                                Scala binary serialization library
                            
                                Why does IDEA report "Error:scalac: error while loading Object, Missing dependency 'object scala in compiler mirror'" building scala breeze?
                            
                                UnresolvedException: Invalid call to dataType on unresolved object when using DataSet constructed from Seq.empty (since Spark 2.3.0)
                            
                                Why does Scala have path-dependent types?
                            
                                Scala slow builds: development approaches to avoid
                            
                                Scala: checking if an object is Numeric
                            
                                How to write non-leaking tail-recursive function using Stream.cons in Scala?
                            
                                Fold/reduce over List of Futures with associative & commutative operator
                            
                                How to define type lambda properly?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark - StorageLevel (DISK_ONLY vs MEMORY_AND_DISK) and Out of memory Java heap space

Tags:

memory

caching

scala

apache-spark

rdd

Matek

People also ask

1 Answers

baitmbarek

Recent Activity

Donate For Us