PYSPARK persist is a data optimization model that is used to store the data in-memory model. It is a time and cost-efficient model that saves up a lot of execution time and cuts up the cost of the data processing. It stores the data that is stored at a different storage level the levels being MEMORY and DISK.
Both caching and persisting are used to save the Spark RDD, Dataframe, and Dataset's. But, the difference is, RDD cache() method default saves it to memory (MEMORY_ONLY) whereas persist() method is used to store it to the user-defined storage level.
Broadcast ProcessSpark reads the data from each partition in the same way it did it during Persist. But it is going to store the data in the executor in the working memory and it is going to take the same amount of space (4 GB per executor).
We can persist the RDD in memory and use it efficiently across parallel operations. The difference between cache() and persist() is that using cache() the default storage level is MEMORY_ONLY while using persist() we can use various storage levels (described below). It is a key tool for an interactive algorithm.
With cache()
, you use only the default storage level :
MEMORY_ONLY
for RDD
MEMORY_AND_DISK
for Dataset
With persist()
, you can specify which storage level you want for both RDD and Dataset.
From the official docs:
- You can mark an
RDD
to be persisted using thepersist
() orcache
() methods on it.- each persisted
RDD
can be stored using a differentstorage level
- The
cache
() method is a shorthand for using the default storage level, which isStorageLevel.MEMORY_ONLY
(store deserialized objects in memory).
Use persist()
if you want to assign a storage level other than :
MEMORY_ONLY
to the RDD
MEMORY_AND_DISK
for Dataset
Interesting link for the official documentation : which storage level to choose
The difference between
cache
andpersist
operations is purely syntactic. cache is a synonym of persist or persist(MEMORY_ONLY
), i.e.cache
is merelypersist
with the default storage levelMEMORY_ONLY
But
Persist()
We can save the intermediate results in 5 storage levels.
- MEMORY_ONLY
- MEMORY_AND_DISK
- MEMORY_ONLY_SER
- MEMORY_AND_DISK_SER
- DISK_ONLY
/** * Persist this RDD with the default storage level (
MEMORY_ONLY
). */
def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)/** * Persist this RDD with the default storage level (
MEMORY_ONLY
). */
def cache(): this.type = persist()
see more details here...
Caching or persistence are optimization techniques for (iterative and interactive) Spark computations. They help saving interim partial results so they can be reused in subsequent stages. These interim results as RDD
s are thus kept in memory (default) or more solid storage like disk and/or replicated.
RDD
s can be cached using cache
operation. They can also be persisted using persist
operation.
#
persist
,cache
These functions can be used to adjust the storage level of a
RDD
. When freeing up memory, Spark will use the storage level identifier to decide which partitions should be kept. The parameter less variantspersist
() andcache
() are just abbreviations forpersist(StorageLevel.MEMORY_ONLY).
Warning: Once the storage level has been changed, it cannot be changed again!
Just because you can cache a RDD
in memory doesn’t mean you should blindly do so. Depending on how many times the dataset is accessed and the amount of work involved in doing so, recomputation can be faster than the price paid by the increased memory pressure.
It should go without saying that if you only read a dataset once there is no point in caching it, it will actually make your job slower. The size of cached datasets can be seen from the Spark Shell..
Listing Variants...
def cache(): RDD[T]
def persist(): RDD[T]
def persist(newLevel: StorageLevel): RDD[T]
See below example :
val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"), 2)
c.getStorageLevel
res0: org.apache.spark.storage.StorageLevel = StorageLevel(false, false, false, false, 1)
c.cache
c.getStorageLevel
res2: org.apache.spark.storage.StorageLevel = StorageLevel(false, true, false, true, 1)
Note :
Due to the very small and purely syntactic difference between caching and persistence of RDD
s the two terms are often used interchangeably.
See more visually here....
Persist in memory and disk:
Caching can improve the performance of your application to a great extent.
There is no difference. From RDD.scala
.
/** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)
/** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
def cache(): this.type = persist()
Spark gives 5 types of Storage level
MEMORY_ONLY
MEMORY_ONLY_SER
MEMORY_AND_DISK
MEMORY_AND_DISK_SER
DISK_ONLY
cache()
will use MEMORY_ONLY
. If you want to use something else, use persist(StorageLevel.<*type*>)
.
By default persist()
will
store the data in the JVM heap as unserialized objects.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With