Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PySpark: do I need to re-cache a DataFrame?

Say I have a dataframe:

rdd = sc.textFile(file)
df = sqlContext.createDataFrame(rdd)
df.cache()

and I add a column

df = df.withColumn('c1', lit(0))

I want to use df repeatedly. So do I need to re-cache() the dataframe, or does Spark automatically do it for me?

like image 261
PSNR Avatar asked Feb 05 '17 01:02

PSNR


1 Answers

you will have to re-cache the dataframe again everytime you manipulate/change the dataframe. However the entire dataframe doesn't have to be recomputed.

df = df.withColumn('c1', lit(0))

In the above statement a new dataframe is created and reassigned to variable df. But this time only the new column is computed and the rest is retrieved from the cache.

like image 130
rogue-one Avatar answered Nov 10 '22 00:11

rogue-one