Spark RDD's are constructed in immutable, fault tolerant and resilient manner. Does RDDs satisfy immutability in all scenarios? Or is there any case, be it in Streaming or Core, where RDD might fail to satisfy immutability?

It depends on what you mean when you talk about <code>RDD</code>. Strictly speaking <code>RDD</code> is just a description of lineage which exists only on the driver and it doesn't provide any methods which can be used to mutate its lineage. When data is processed we can no longer talk about about RDDs but tasks nevertheless data is exposed using immutable data structures (<code>scala.collection.Iterator</code> in Scala, <code>itertools.chain</code> in Python). So far so good. Unfortunately immutability of a data structure doesn't imply immutability of the stored data. Lets create a small example to illustrate that: <pre class="prettyprint lang-scala prettyprint-override"><code>val rdd = sc.parallelize(Array(0) :: Array(0) :: Array(0) :: Nil) rdd.map(a => { a(0) +=1; a.head }).sum // Double = 3.0 </code></pre> You can execute this as many times as you want and get the same result. Now lets <code>cache</code> <code>rdd</code> and repeat a whole process: <pre class="prettyprint lang-scala prettyprint-override"><code>rdd.cache rdd.map(a => { a(0) +=1; a.head }).sum // Double = 3.0 rdd.map(a => { a(0) +=1; a.head }).sum // Double = 6.0 rdd.map(a => { a(0) +=1; a.head }).sum // Double = 9.0 </code></pre> Since function we use in the first <code>map</code> is not pure and modifies its mutable argument in place these changes are accumulated with each execution and result in unpredictable output. For example if <code>rdd</code> is evicted from cache we can once again get 3.0. If some partitions are not cached you can mixed results. PySpark provides stronger isolation and obtaining result like this is not possible but it is a matter of architecture not a immutability. Take away message here is that you should be extremely careful when working with mutable data and avoid any modifications in place unless it is explicitly allowed (<code>fold</code>, <code>aggregate</code>).

Will there be any scenario, where Spark RDD's fail to satisfy immutability.?

1 Answers

It depends on what you mean when you talk about RDD. Strictly speaking RDD is just a description of lineage which exists only on the driver and it doesn't provide any methods which can be used to mutate its lineage.

When data is processed we can no longer talk about about RDDs but tasks nevertheless data is exposed using immutable data structures (scala.collection.Iterator in Scala, itertools.chain in Python).

So far so good. Unfortunately immutability of a data structure doesn't imply immutability of the stored data. Lets create a small example to illustrate that:

val rdd = sc.parallelize(Array(0) :: Array(0) :: Array(0) :: Nil)
rdd.map(a => { a(0) +=1; a.head }).sum
// Double = 3.0

You can execute this as many times as you want and get the same result. Now lets cache rdd and repeat a whole process:

rdd.cache
rdd.map(a => { a(0) +=1; a.head }).sum
// Double = 3.0
rdd.map(a => { a(0) +=1; a.head }).sum
// Double = 6.0
rdd.map(a => { a(0) +=1; a.head }).sum
// Double = 9.0

Since function we use in the first map is not pure and modifies its mutable argument in place these changes are accumulated with each execution and result in unpredictable output. For example if rdd is evicted from cache we can once again get 3.0. If some partitions are not cached you can mixed results.

PySpark provides stronger isolation and obtaining result like this is not possible but it is a matter of architecture not a immutability.

Take away message here is that you should be extremely careful when working with mutable data and avoid any modifications in place unless it is explicitly allowed (fold, aggregate).

150

answered Sep 18 '22 12:09

zero323

Related questions
                            
                                What's the purpose of OutputMode in flatMapGroupsWithState? How/where is it used?
                            
                                List all additional jars loaded in pyspark
                            
                                pyspark 'DataFrame' object has no attribute '_get_object_id'
                            
                                Using partitions (with partitionBy) when writing a delta lake has no effect
                            
                                Why joining structure-identic dataframes gives different results?
                            
                                Spark processing columns in parallel
                            
                                How to run script in Pyspark and drop into IPython shell when done?
                            
                                how to run python script in spark job?
                            
                                spark scalability: what am I doing wrong?
                            
                                how to collect spark sql output to a file?
                            
                                How to save/export a Spark ML Lib model to PMML?
                            
                                Concurrent job Execution in Spark
                            
                                Equivalent of Distributed Cache in Spark? [duplicate]
                            
                                Spark MLlib: building classifiers for each data group
                            
                                What are the best practices to partition Parquet files by timestamp in Spark?
                            
                                Get a range of columns of Spark RDD
                            
                                Ever increasing physical memory for a Spark application in YARN
                            
                                Best practice for integrating Kafka and HBase
                            
                                How to persist sorted parquet tables for future sort merge joins?
                            
                                Exception running /etc/hadoop/conf.cloudera.yarn/topology.py

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Will there be any scenario, where Spark RDD's fail to satisfy immutability.?

Tags:

apache-spark

rdd

spark-streaming

Srini

People also ask

1 Answers

zero323

Recent Activity

Donate For Us