Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What happens if the data can't fit in memory with cache() in Spark?

I am new to Spark. I have read at multiple places that using cache() on a RDD will cause it to be stored in memory but I haven't so far found clear guidelines or rules of thumb on "How to determine the max size of data" that one could cram into memory? What happens if the amount of data that I am calling "cache" on, exceeds the memory ? Will it cause my job to fail or will it still complete with a noticeable impact on Cluster performance?

Thanks!

like image 234
user2895779 Avatar asked Feb 29 '16 20:02

user2895779


1 Answers

As it is clearly stated in the official documentation with MEMORY_ONLY persistence (equivalent to cache):

If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed.

Even if data fits into memory it can be evicted if new data comes in. In practice caching is more a hint than a contract. You cannot depend on caching take place but you don't have to if it succeeds either.

Note:

Please keep in mind that the default StorageLevel for Dataset is MEMORY_AND_DISK, which will:

If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed.

See also:

  • (Why) do we need to call cache or persist on a RDD
  • Why do I have to explicitly tell Spark what to cache?
like image 70
zero323 Avatar answered Sep 22 '22 00:09

zero323