Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I check whether my RDD or dataframe is cached or not?

Tags:

I have created a dataframe say df1. I cached this by using df1.cache(). How can I check whether this has been cached or not? Also is there a way so that I am able to see all my cached RDD's or dataframes.

like image 906
Arnab Avatar asked Sep 07 '15 07:09

Arnab


People also ask

How do you check if a DataFrame is cached or not?

You can call getStorageLevel. useMemory on the Dataframe and the RDD to find out if the dataset is in memory.

What is caching in RDD?

Both caching and persisting are used to save the Spark RDD, Dataframe, and Dataset's. But, the difference is, RDD cache() method default saves it to memory (MEMORY_ONLY) whereas persist() method is used to store it to the user-defined storage level.

What is cache () default storage level for RDD?

You can mark an RDD to be persisted using the persist () or cache () methods on it. each persisted RDD can be stored using a different storage level. The cache () method is a shorthand for using the default storage level, which is StorageLevel. MEMORY_ONLY (store deserialized objects in memory).

What does cache () do in Pyspark?

cache() is an Apache Spark transformation that can be used on a DataFrame, Dataset, or RDD when you want to perform more than one action. cache() caches the specified DataFrame, Dataset, or RDD in the memory of your cluster's workers.


2 Answers

You can call getStorageLevel.useMemory on the Dataframe and the RDD to find out if the dataset is in memory.

For the Dataframe do this:

scala> val df = Seq(1, 2).toDF()
df: org.apache.spark.sql.DataFrame = [value: int]

scala> df.storageLevel.useMemory
res1: Boolean = false

scala> df.cache()
res0: df.type = [value: int]

scala> df.storageLevel.useMemory
res1: Boolean = true

For the RDD do this:

scala> val rdd = sc.parallelize(Seq(1,2))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:21

scala> rdd.getStorageLevel.useMemory
res9: Boolean = false

scala> rdd.cache()
res10: rdd.type = ParallelCollectionRDD[1] at parallelize at <console>:21

scala> rdd.getStorageLevel.useMemory
res11: Boolean = true
like image 196
Patrick McGloin Avatar answered Oct 16 '22 20:10

Patrick McGloin


@Arnab,

Did you find the function in Python?
Here is an example for DataFrame DF:

DF.cache()
print DF.is_cached

Hope this helps.
Ram

like image 40
user6296218 Avatar answered Oct 16 '22 22:10

user6296218