I want to know more precisely about the use of the method cache for dataframe in pyspark
When I run df.cache()
it returns a dataframe.
Therefore, if I do df2 = df.cache()
, which dataframe is in cache ? Is it df
, df2
, or both ?
I found the source code DataFrame.cache
def cache(self):
"""Persists the :class:`DataFrame` with the default storage level (`MEMORY_AND_DISK`).
.. note:: The default storage level has changed to `MEMORY_AND_DISK` to match Scala in 2.0.
"""
self.is_cached = True
self._jdf.cache()
return self
Therefore, the answer is : both
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With