relationship between RDD , partitions and nodes

Question

I have been reading about RDDs and how various transformations are affected by partitions, and how some transformations affect partitions themselves. While I understand this, I am not able to relate it to the bigger picture as to how this fits in a cluster where we have multiple nodes.

Is there one to one correspondence between a partition and a node? I mean if there is a single partition per node ideally? And if not, how does Spark decide how many partitions for a specific RDD have to reside on the same node?

More specifically, I can think of one of the following:-

1) All the partitions for a given RDD on the same node 2) All partitions of the same RDD could reside on different nodes (but what is the basis of split?) 3) Partitions of the same node are scattered across cluster, some of them on the same node, some of them on different nodes (again, what is the basis of this distribution?)

Can someone please explain or at least point me to some specific link which answers exactly this?

zero323 · Accepted Answer

a single RDD has one or more partitions scattered across multiple nodes,
a single partition is processed on a single node,
a single node can handle multiple partitions (with optimum 2-4 partitions per CPU according to the official documentation)

Since Spark supports pluggable resource management details of the distribution will depend on the one you use (Standalone, Yarn, Messos).

relationship between RDD , partitions and nodes

Tags:

apache-spark

rdd

Dhiraj

1 Answers

zero323

Recent Activity

Donate For Us

relationship between RDD , partitions and nodes

Tags:

apache-spark

rdd

Dhiraj

1 Answers

zero323

Related questions

Recent Activity

Donate For Us