What is the difference between cache and persist?

Warning -Cache judiciously... see ((Why) do we need to call cache or persist on a RDD)

Just because you can cache a RDD in memory doesn’t mean you should blindly do so. Depending on how many times the dataset is accessed and the amount of work involved in doing so, recomputation can be faster than the price paid by the increased memory pressure.

It should go without saying that if you only read a dataset once there is no point in caching it, it will actually make your job slower. The size of cached datasets can be seen from the Spark Shell..

Listing Variants...

def cache(): RDD[T]
 def persist(): RDD[T]
 def persist(newLevel: StorageLevel): RDD[T]

See below example :

val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"), 2)
     c.getStorageLevel
     res0: org.apache.spark.storage.StorageLevel = StorageLevel(false, false, false, false, 1)
     c.cache
     c.getStorageLevel
     res2: org.apache.spark.storage.StorageLevel = StorageLevel(false, true, false, true, 1)

enter image here

Note : Due to the very small and purely syntactic difference between caching and persistence of RDDs the two terms are often used interchangeably.

See more visually here....

Persist in memory and disk:

enter image description here

Cache

Caching can improve the performance of your application to a great extent.

enter image description here

There is no difference. From RDD.scala.

/** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)

/** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
def cache(): this.type = persist()

Spark gives 5 types of Storage level

MEMORY_ONLY
MEMORY_ONLY_SER
MEMORY_AND_DISK
MEMORY_AND_DISK_SER
DISK_ONLY

cache() will use MEMORY_ONLY. If you want to use something else, use persist(StorageLevel.<*type*>).

By default persist() will store the data in the JVM heap as unserialized objects.

Related questions
                            
                                Filter Pyspark dataframe column with None value
                            
                                How to convert rdd object to dataframe in spark
                            
                                How to set Apache Spark Executor memory
                            
                                Apache Spark: map vs mapPartitions?
                            
                                How to store custom objects in Dataset?
                            
                                Concatenate columns in Apache Spark DataFrame
                            
                                How are stages split into tasks in Spark?
                            
                                Spark - load CSV file as DataFrame?
                            
                                How to sort by column in descending order in Spark SQL?
                            
                                How to turn off INFO logging in Spark?
                            
                                How do I add a new column to a Spark DataFrame (using PySpark)?
                            
                                How can I change column types in Spark SQL's DataFrame?
                            
                                How to add a constant column in a Spark DataFrame?
                            
                                How to select the first row of each group?
                            
                                How to read multiple text files into a single RDD?
                            
                                Add jars to a Spark Job - spark-submit
                            
                                (Why) do we need to call cache or persist on a RDD
                            
                                Spark performance for Scala vs Python
                            
                                How to stop INFO messages displaying on spark console?
                            
                                Apache Spark: The number of cores vs. the number of executors

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the difference between cache and persist?

Tags:

distributed-computing

apache-spark

rdd

People also ask

Warning -Cache judiciously... see ((Why) do we need to call cache or persist on a RDD)

Cache

Recent Activity

Donate For Us