Spark RDD's are constructed in immutable, fault tolerant and resilient manner.
Does RDDs satisfy immutability in all scenarios? Or is there any case, be it in Streaming or Core, where RDD might fail to satisfy immutability?
RDDs are immutable (it means that you cannot alter the state of RDD (i.e you cannot add new records or delete records or update records inside RDD.) ) .....
Failure of worker node – The node which runs the application code on the Spark cluster is Spark worker node. These are the slave nodes. Any of the worker nodes running executor can fail, thus resulting in loss of in-memory If any receivers were running on failed nodes, then their buffer data will be lost.
There are few reasons for keeping RDD immutable as follows: 1- Immutable data can be shared easily. 2- It can be created at any point of time. 3- Immutable data can easily live on memory as on disk.
Spark is by its nature very fault tolerant. However, faults, and application failures, can and do happen, in production at scale.In this talk, we'll discuss the nuts and bolts of fault tolerance in Spark.
It depends on what you mean when you talk about RDD
. Strictly speaking RDD
is just a description of lineage which exists only on the driver and it doesn't provide any methods which can be used to mutate its lineage.
When data is processed we can no longer talk about about RDDs but tasks nevertheless data is exposed using immutable data structures (scala.collection.Iterator
in Scala, itertools.chain
in Python).
So far so good. Unfortunately immutability of a data structure doesn't imply immutability of the stored data. Lets create a small example to illustrate that:
val rdd = sc.parallelize(Array(0) :: Array(0) :: Array(0) :: Nil)
rdd.map(a => { a(0) +=1; a.head }).sum
// Double = 3.0
You can execute this as many times as you want and get the same result. Now lets cache
rdd
and repeat a whole process:
rdd.cache
rdd.map(a => { a(0) +=1; a.head }).sum
// Double = 3.0
rdd.map(a => { a(0) +=1; a.head }).sum
// Double = 6.0
rdd.map(a => { a(0) +=1; a.head }).sum
// Double = 9.0
Since function we use in the first map
is not pure and modifies its mutable argument in place these changes are accumulated with each execution and result in unpredictable output. For example if rdd
is evicted from cache we can once again get 3.0. If some partitions are not cached you can mixed results.
PySpark provides stronger isolation and obtaining result like this is not possible but it is a matter of architecture not a immutability.
Take away message here is that you should be extremely careful when working with mutable data and avoid any modifications in place unless it is explicitly allowed (fold
, aggregate
).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With