Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does Apache Spark cache RDD in node-level or cluster-level?

I know that Apache Spark persist method saves RDDs in memory and that if there is not enough memory space, it stores the remaining partitions of the RDD in the filesystem (disk). What I can't seem to understand is the following:

Imagine we have a cluster and we want to persist an RDD. Suppose node A does not have a lot of memory space and that node B does. Let's suppose now that after running the persist command, node A runs out of memory. The question now is:

Does Apache Spark search for more memory space in node B and try to store everything in memory?

Or given that there is not enough space in node A, Spark stores the remaining partitions of the RDD in the disk of node A even if there some memory space available in node B?

Thanks for your answers.

like image 671
YACINE GACI Avatar asked Nov 20 '25 07:11

YACINE GACI


1 Answers

Normally Spark doesn't search for the free space. Data is cached locally on the executor responsible for a particular partition.

The only exception is the case when you use replicated persistence mode - in that case additional copy will be place on another node.

like image 52
user10391155 Avatar answered Nov 22 '25 01:11

user10391155



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!