Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to check if Spark RDD is in memory?

I have an instance of org.apache.spark.rdd.RDD[MyClass]. How can I programmatically check if the instance is persist\inmemory?

like image 874
Dmitry Petrov Avatar asked Jun 06 '15 22:06

Dmitry Petrov


People also ask

Is Spark RDD in memory?

The main features of a Spark RDD are: In-memory computation. Data calculation resides in memory for faster access and fewer I/O operations. Fault tolerance.

Is RDD in memory or disk?

This process speeds up the further computation ten times. When the RDD is computed for the first time, it is kept in memory on the node. The cache memory of the Spark is fault tolerant so whenever any partition of RDD is lost, it can be recovered by transformation Operation that originally created it.

How do I know if my RDD is empty?

isEmpty. Returns true if and only if the RDD contains no elements at all. An RDD may be empty even when it has at least 1 partition.

Does RDD reside in default memory?

Yes, All 10 RDDs data will spread in spark worker machines RAM. but not necessary to all machines must have a partition of each RDD. off course RDD will have data in memory only if any action performed on it as it's lazily evaluated.


2 Answers

You want RDD.getStorageLevel. It will return StorageLevel.None if empty. However that is only if it is marked for caching or not. If you want the actual status you can use the developer api sc.getRDDStorageInfo or sc.getPersistentRDD

like image 127
Justin Pihony Avatar answered Nov 10 '22 13:11

Justin Pihony


You can call rdd.getStorageLevel.useMemory to check if it is in memory or not as follows:

scala> myrdd.getStorageLevel.useMemory
res3: Boolean = false

scala> myrdd.cache()
res4: myrdd.type = MapPartitionsRDD[2] at filter at <console>:29

scala> myrdd.getStorageLevel.useMemory
res5: Boolean = true
like image 20
KayV Avatar answered Nov 10 '22 12:11

KayV