Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

relationship between RDD , partitions and nodes

I have been reading about RDDs and how various transformations are affected by partitions, and how some transformations affect partitions themselves. While I understand this, I am not able to relate it to the bigger picture as to how this fits in a cluster where we have multiple nodes.

Is there one to one correspondence between a partition and a node? I mean if there is a single partition per node ideally? And if not, how does Spark decide how many partitions for a specific RDD have to reside on the same node?

More specifically, I can think of one of the following:-

1) All the partitions for a given RDD on the same node 2) All partitions of the same RDD could reside on different nodes (but what is the basis of split?) 3) Partitions of the same node are scattered across cluster, some of them on the same node, some of them on different nodes (again, what is the basis of this distribution?)

Can someone please explain or at least point me to some specific link which answers exactly this?

like image 691
Dhiraj Avatar asked Jul 11 '15 16:07

Dhiraj


1 Answers

  • a single RDD has one or more partitions scattered across multiple nodes,
  • a single partition is processed on a single node,
  • a single node can handle multiple partitions (with optimum 2-4 partitions per CPU according to the official documentation)

Since Spark supports pluggable resource management details of the distribution will depend on the one you use (Standalone, Yarn, Messos).

like image 200
zero323 Avatar answered Oct 11 '22 11:10

zero323