PySpark: do I need to re-cache a DataFrame?

Question

Say I have a dataframe:

rdd = sc.textFile(file)
df = sqlContext.createDataFrame(rdd)
df.cache()

and I add a column

df = df.withColumn('c1', lit(0))

I want to use df repeatedly. So do I need to re-cache() the dataframe, or does Spark automatically do it for me?

rogue-one · Accepted Answer

you will have to re-cache the dataframe again everytime you manipulate/change the dataframe. However the entire dataframe doesn't have to be recomputed.

df = df.withColumn('c1', lit(0))

In the above statement a new dataframe is created and reassigned to variable df. But this time only the new column is computed and the rest is retrieved from the cache.

PySpark: do I need to re-cache a DataFrame?

Tags:

apache-spark

apache-spark-sql

pyspark

spark-dataframe

PSNR

1 Answers

rogue-one

Recent Activity

Donate For Us

PySpark: do I need to re-cache a DataFrame?

Tags:

apache-spark

apache-spark-sql

pyspark

spark-dataframe

PSNR

1 Answers

rogue-one

Related questions

Recent Activity

Donate For Us