Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

cache a dataframe in pyspark

I want to know more precisely about the use of the method cache for dataframe in pyspark

When I run df.cache() it returns a dataframe. Therefore, if I do df2 = df.cache(), which dataframe is in cache ? Is it df, df2, or both ?

like image 778
Steven Avatar asked Dec 04 '17 18:12

Steven


1 Answers

I found the source code DataFrame.cache

def cache(self):
    """Persists the :class:`DataFrame` with the default storage level (`MEMORY_AND_DISK`).

    .. note:: The default storage level has changed to `MEMORY_AND_DISK` to match Scala in 2.0.
    """
    self.is_cached = True
    self._jdf.cache()
    return self

Therefore, the answer is : both

like image 162
Steven Avatar answered Nov 03 '22 12:11

Steven