Say I have a dataframe:
rdd = sc.textFile(file)
df = sqlContext.createDataFrame(rdd)
df.cache()
and I add a column
df = df.withColumn('c1', lit(0))
I want to use df
repeatedly. So do I need to re-cache()
the dataframe, or does Spark automatically do it for me?
you will have to re-cache the dataframe again everytime you manipulate/change the dataframe. However the entire dataframe doesn't have to be recomputed.
df = df.withColumn('c1', lit(0))
In the above statement a new dataframe is created and reassigned to variable df
. But this time only the new column is computed and the rest is retrieved from the cache.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With