Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Where Spark RDD lineage is stored?

Where Spark RDD Lineage is stored? As per white paper on RDD, it is persisted in-memory but want to know if it is at driver side or somewhere else on cluster.

Also how fault-tolerance is ensured i.e. how many replications of RDD (metadata) are created by default?

I want to understand core framework behaviour when we are not using persist() method.

like image 579
Bhavuk Chawla Avatar asked Jan 11 '16 03:01

Bhavuk Chawla


1 Answers

The RDD lineage lives on the driver where RDDs live. When jobs are submitted, this information is no longer relevant. It's an internal part of any RDD and that's how it knows the parents.

When the driver fails RDD lineage is gone as is the entire computation. The driver is...well...the driver and without it nothing really happens.

like image 127
Jacek Laskowski Avatar answered Oct 17 '22 11:10

Jacek Laskowski