I have been reading about RDDs and how various transformations are affected by partitions, and how some transformations affect partitions themselves. While I understand this, I am not able to relate it to the bigger picture as to how this fits in a cluster where we have multiple nodes.
Is there one to one correspondence between a partition and a node? I mean if there is a single partition per node ideally? And if not, how does Spark decide how many partitions for a specific RDD have to reside on the same node?
More specifically, I can think of one of the following:-
1) All the partitions for a given RDD on the same node 2) All partitions of the same RDD could reside on different nodes (but what is the basis of split?) 3) Partitions of the same node are scattered across cluster, some of them on the same node, some of them on different nodes (again, what is the basis of this distribution?)
Can someone please explain or at least point me to some specific link which answers exactly this?
Since Spark supports pluggable resource management details of the distribution will depend on the one you use (Standalone, Yarn, Messos).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With