I have an instance of org.apache.spark.rdd.RDD[MyClass]. How can I programmatically check if the instance is persist\inmemory?
The main features of a Spark RDD are: In-memory computation. Data calculation resides in memory for faster access and fewer I/O operations. Fault tolerance.
This process speeds up the further computation ten times. When the RDD is computed for the first time, it is kept in memory on the node. The cache memory of the Spark is fault tolerant so whenever any partition of RDD is lost, it can be recovered by transformation Operation that originally created it.
isEmpty. Returns true if and only if the RDD contains no elements at all. An RDD may be empty even when it has at least 1 partition.
Yes, All 10 RDDs data will spread in spark worker machines RAM. but not necessary to all machines must have a partition of each RDD. off course RDD will have data in memory only if any action performed on it as it's lazily evaluated.
You want RDD.getStorageLevel
. It will return StorageLevel.None
if empty. However that is only if it is marked for caching or not. If you want the actual status you can use the developer api sc.getRDDStorageInfo
or sc.getPersistentRDD
You can call rdd.getStorageLevel.useMemory to check if it is in memory or not as follows:
scala> myrdd.getStorageLevel.useMemory
res3: Boolean = false
scala> myrdd.cache()
res4: myrdd.type = MapPartitionsRDD[2] at filter at <console>:29
scala> myrdd.getStorageLevel.useMemory
res5: Boolean = true
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With