I am using persist on different storage levels, but I found no difference on performance when I was using MEMORY_ONLY
and DISK_ONLY
.
I think there might be something wrong with my code... Where can I find the persisted RDDs on disk so that I can make sure they were actually persisted?
RDDs are the main logical data units in Spark. They are a distributed collection of objects, which are stored in memory or on disks of different machines of a cluster. A single RDD can be divided into multiple logical partitions so that these partitions can be stored and processed on different machines of a cluster.
In this storage level, RDD is stored only on disk. The space used for storage is low, the CPU computation time is high and it makes use of on disk storage.
The RDDs store data in memory for fast access to data during computation and provide fault tolerance [110]. An RDD is an immutable distributed collection of key–value pairs of data, stored across nodes in the cluster. The RDD can be operated in parallel.
To persist an RDD, we use persist ( ) method. We can use apache spark through scala, python, java etc coding. Persist( ) method will always store the data in JVM. In java virtual machine as an unserialized object, while working with java and scala.
As per the doc:
spark.local.dir
(by default/tmp
)Directory to use for "scratch" space in Spark, including map output files and RDDs that get stored on disk. This should be on a fast, local disk in your system. It can also be a comma-separated list of multiple directories on different disks. NOTE: In Spark 1.0 and later this will be overriden by SPARK_LOCAL_DIRS (Standalone, Mesos) or LOCAL_DIRS (YARN) environment variables set by the cluster manager.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With