What happens if the data can't fit in memory with cache() in Spark?

Question

I am new to Spark. I have read at multiple places that using cache() on a RDD will cause it to be stored in memory but I haven't so far found clear guidelines or rules of thumb on "How to determine the max size of data" that one could cram into memory? What happens if the amount of data that I am calling "cache" on, exceeds the memory ? Will it cause my job to fail or will it still complete with a noticeable impact on Cluster performance?

Thanks!

zero323 · Accepted Answer

As it is clearly stated in the official documentation with MEMORY_ONLY persistence (equivalent to cache):

If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed.

Even if data fits into memory it can be evicted if new data comes in. In practice caching is more a hint than a contract. You cannot depend on caching take place but you don't have to if it succeeds either.

Note:

Please keep in mind that the default StorageLevel for Dataset is MEMORY_AND_DISK, which will:

If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed.

See also:

(Why) do we need to call cache or persist on a RDD
Why do I have to explicitly tell Spark what to cache?

What happens if the data can't fit in memory with cache() in Spark?

Tags:

distributed-computing

apache-spark

cluster-computing

user2895779

1 Answers

zero323

Recent Activity

Donate For Us

What happens if the data can't fit in memory with cache() in Spark?

Tags:

distributed-computing

apache-spark

cluster-computing

user2895779

1 Answers

zero323

Related questions

Recent Activity

Donate For Us