Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Will there be any scenario, where Spark RDD's fail to satisfy immutability.?

Spark RDD's are constructed in immutable, fault tolerant and resilient manner.

Does RDDs satisfy immutability in all scenarios? Or is there any case, be it in Streaming or Core, where RDD might fail to satisfy immutability?

like image 257
Srini Avatar asked Sep 06 '15 17:09

Srini


People also ask

Is Spark RDD immutable?

RDDs are immutable (it means that you cannot alter the state of RDD (i.e you cannot add new records or delete records or update records inside RDD.) ) .....

What happens when worker node fails in Spark?

Failure of worker node – The node which runs the application code on the Spark cluster is Spark worker node. These are the slave nodes. Any of the worker nodes running executor can fail, thus resulting in loss of in-memory If any receivers were running on failed nodes, then their buffer data will be lost.

Why is RDD immutable in nature?

There are few reasons for keeping RDD immutable as follows: 1- Immutable data can be shared easily. 2- It can be created at any point of time. 3- Immutable data can easily live on memory as on disk.

Does Spark support fault tolerance?

Spark is by its nature very fault tolerant. However, faults, and application failures, can and do happen, in production at scale.In this talk, we'll discuss the nuts and bolts of fault tolerance in Spark.


1 Answers

It depends on what you mean when you talk about RDD. Strictly speaking RDD is just a description of lineage which exists only on the driver and it doesn't provide any methods which can be used to mutate its lineage.

When data is processed we can no longer talk about about RDDs but tasks nevertheless data is exposed using immutable data structures (scala.collection.Iterator in Scala, itertools.chain in Python).

So far so good. Unfortunately immutability of a data structure doesn't imply immutability of the stored data. Lets create a small example to illustrate that:

val rdd = sc.parallelize(Array(0) :: Array(0) :: Array(0) :: Nil)
rdd.map(a => { a(0) +=1; a.head }).sum
// Double = 3.0

You can execute this as many times as you want and get the same result. Now lets cache rdd and repeat a whole process:

rdd.cache
rdd.map(a => { a(0) +=1; a.head }).sum
// Double = 3.0
rdd.map(a => { a(0) +=1; a.head }).sum
// Double = 6.0
rdd.map(a => { a(0) +=1; a.head }).sum
// Double = 9.0

Since function we use in the first map is not pure and modifies its mutable argument in place these changes are accumulated with each execution and result in unpredictable output. For example if rdd is evicted from cache we can once again get 3.0. If some partitions are not cached you can mixed results.

PySpark provides stronger isolation and obtaining result like this is not possible but it is a matter of architecture not a immutability.

Take away message here is that you should be extremely careful when working with mutable data and avoid any modifications in place unless it is explicitly allowed (fold, aggregate).

like image 150
zero323 Avatar answered Sep 18 '22 12:09

zero323