Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

If dataframes in Spark are immutable, why are we able to modify it with operations such as withColumn()?

This is probably a stupid question originating from my ignorance. I have been working on PySpark for a few weeks now and do not have much programming experience to start with.

My understanding is that in Spark, RDDs, Dataframes, and Datasets are all immutable - which, again I understand, means you cannot change the data. If so, why are we able to edit a Dataframe's existing column using withColumn()?

like image 356
pallupz Avatar asked Nov 19 '18 11:11

pallupz


People also ask

Why is DataFrame immutable in Spark?

As per Spark Architecture DataFrame is built on top of RDDs which are immutable in nature, Hence Data frames are immutable in nature as well.

What is the use of withColumn in Spark?

In Spark SQL, the withColumn() function is the most popular one, which is used to derive a column from multiple columns, change the current value of a column, convert the datatype of an existing column, create a new column, and many more.

Are Spark Datasets immutable?

While Pyspark derives its basic data types from Python, its own data structures are limited to RDD, Dataframes, Graphframes. These data frames are immutable and offer reduced flexibility during row/column level handling, as compared to Python.

Why is RDD immutable?

RDDs are immutable (it means that you cannot alter the state of RDD (i.e you cannot add new records or delete records or update records inside RDD.) ) .....


1 Answers

As per Spark Architecture DataFrame is built on top of RDDs which are immutable in nature, Hence Data frames are immutable in nature as well.

Regarding the withColumn or any other operation for that matter, when you apply such operations on DataFrames it will generate a new data frame instead of updating the existing data frame.

However, When you are working with python which is dynamically typed language you overwrite the value of the previous reference. Hence when you are executing below statement

df = df.withColumn()

It will generate another dataframe and assign it to reference "df".

In order to verify the same, you can use id() method of rdd to get the unique identifier of your dataframe.

df.rdd.id()

will give you unique identifier for your dataframe.

I hope the above explanation helps.

Regards,

Neeraj

like image 175
Neeraj Bhadani Avatar answered Sep 19 '22 18:09

Neeraj Bhadani