Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does Spark recover the data from a failed node?

Suppose we have an RDD, which is being used multiple times. So to save the computations again and again, we persisted this RDD using the rdd.persist() method.

So when we are persisting this RDD, the nodes computing the RDD will be storing their partitions.

So now suppose, the node containing this persisted partition of RDD fails, then what will happen? How will spark recover the lost data? Is there any replication mechanism? Or some other mechanism?

like image 254
KayV Avatar asked Jan 30 '23 01:01

KayV


1 Answers

When you do rdd.persist, rdd doesn't materialize the content. It does when you perform an action on the rdd. It follows the same lazy evaluation principle.

Now an RDD knows the partition on which it should operate and the DAG associated with it. With the DAG it is perfectly capable of recreating the materialized partition.

So, when a node fails the driver spawn another executor in some other node and provides it the Data partition on which it was supposed to work and the DAG associated with it in a closure. Now with this information it can recompute the data and materialize it.

In the mean time the cached data in the RDD won't have all the data in memory, the data of the lost nodes it has to fetch from the disk it will take so little more time.

On the replication, yes spark supports in memory replication. You need to set StorageLevel.MEMORY_DISK_2 when you persist.

rdd.persist(StorageLevel.MEMORY_DISK_2)

This ensures the data is replicated twice.

like image 51
Avishek Bhattacharya Avatar answered Jan 31 '23 14:01

Avishek Bhattacharya