When are cache and persist executed (since they don't seem like actions)?

Tags:

I am implementing a spark application, of which below is a sample snippet(Not the exact same code):

val rdd1 = sc.textfile(HDFS_PATH)
val rdd2 = rdd1.map(func)
rdd2.persist(StorageLevel.MEMORY_AND_DISK)
println(rdd2.count)

On checking the performance of this code from the Spark Application Master UI, I see an entry for the count action, but not for the persist. The DAG for this count action also has a node for the 'map' transformation (line 2 of the above code).

Is it safe to conclude that the map transformation is executed when count (in the last line) is encountered, and not when persist is encountered?

Also, at what point is rdd2 actually persisted? I understand that only two types of operations can be called on RDDs - transformations and actions. If the RDD is persisted lazily when the count action is called, would persist be considered a transformation or an action or neither?

442

asked May 16 '17 12:05

Ankit Khettry

1 Answers

Dataset's cache and persist operators are lazy and don't have any effect until you call an action (and wait till the caching has finished which is the extra price for having a better performance later on).

From Spark's official documentation RDD Persistence (with the sentence in bold mine):

One of the most important capabilities in Spark is persisting (or caching) a dataset in memory across operations. When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it). This allows future actions to be much faster (often by more than 10x). Caching is a key tool for iterative algorithms and fast interactive use.

You can mark an RDD to be persisted using the persist() or cache() methods on it. The first time it is computed in an action, it will be kept in memory on the nodes. Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it.

That's exactly the reason why some people (and Spark SQL itself!) do the following trick:

rdd2.persist(StorageLevel.MEMORY_AND_DISK).count

to trigger the caching.

count operator is fairly cheap so the net effect is that the caching is executed almost immediately after the line (there might be a small delay before the caching has completed as it executes asynchronously).

The benefits of this count after persist are as follows:

No action (but the count itself) will "suffer" the extra time for caching
The time between this line and the place where the cached rdd2 is used could be enough to fully complete the caching and hence the time would be used better (without extra "slowdown" for caching)

So when you asked:

would persist be considered a transformation or an action or neither?

I'd say it's neither and consider it an optimization hint (that may or may not be executed or taken into account ever).

Use web UI's Storage tab to see what Datasets (as their underlying RDDs) have already been persisted.

enter image description here

You can also see cache or persist operators' output using explain (or simply QueryExecution.optimizedPlan).

val q1 = spark.range(10).groupBy('id % 5).agg(count("*") as "count").cache
scala> q1.explain
== Physical Plan ==
*(1) ColumnarToRow
+- InMemoryTableScan [(id % 5)#120L, count#119L]
      +- InMemoryRelation [(id % 5)#120L, count#119L], StorageLevel(disk, memory, deserialized, 1 replicas)
            +- *(2) HashAggregate(keys=[(id#0L % 5)#8L], functions=[count(1)])
               +- Exchange hashpartitioning((id#0L % 5)#8L, 200), true, [id=#13]
                  +- *(1) HashAggregate(keys=[(id#0L % 5) AS (id#0L % 5)#8L], functions=[partial_count(1)])
                     +- *(1) Range (0, 10, step=1, splits=16)

scala> println(q1.queryExecution.optimizedPlan.numberedTreeString)
00 InMemoryRelation [(id % 5)#5L, count#4L], StorageLevel(disk, memory, deserialized, 1 replicas)
01    +- *(2) HashAggregate(keys=[(id#0L % 5)#8L], functions=[count(1)], output=[(id % 5)#5L, count#4L])
02       +- Exchange hashpartitioning((id#0L % 5)#8L, 200), true, [id=#13]
03          +- *(1) HashAggregate(keys=[(id#0L % 5) AS (id#0L % 5)#8L], functions=[partial_count(1)], output=[(id#0L % 5)#8L, count#10L])
04             +- *(1) Range (0, 10, step=1, splits=16)

Please note that the count above is a standard function not an action and no caching happens. It's just a coincidence that count is the name of a standard function and an Dataset action.

You can cache a table using pure SQL (this is eager!)

// That registers range5 to contain the output of range(5) function
spark.sql("CACHE TABLE range5 AS SELECT * FROM range(5)")
val q2 = spark.sql("SELECT * FROM range5")
scala> q2.explain
== Physical Plan ==
*(1) ColumnarToRow
+- Scan In-memory table `range5` [id#51L]
      +- InMemoryRelation [id#51L], StorageLevel(disk, memory, deserialized, 1 replicas)
            +- *(1) Range (0, 5, step=1, splits=16)

InMemoryTableScan physical operator (with InMemoryRelation logical plan) is how you can make sure that the query is cached in-memory and hence reused.

Moreover, Spark SQL itself uses the same pattern to trigger DataFrame caching for SQL's CACHE TABLE query (which, unlike RDD caching, is by default eager):

if (!isLazy) {
  // Performs eager caching
  sparkSession.table(tableIdent).count()
}

That means that depending on the operators you may have different result as far as caching is concerned. cache and persist operators are lazy by default while SQL's CACHE TABLE is eager.

151

answered Oct 11 '22 04:10

Jacek Laskowski

Related questions
                            
                                How does Scala attain parallelism?
                            
                                Scala XML: create a node not using literals
                            
                                Scala: return has its place
                            
                                Custom mapping to nested case class structure in Slick (more than 22 columns)
                            
                                How is this case class match pattern working?
                            
                                List of every n-th item in a given list
                            
                                Can I override a scala class method with a method from a trait?
                            
                                Why no immutable double linked list in Scala collections?
                            
                                Execute code before and after specification
                            
                                NotSerializableException for `Map[String, String]` alias
                            
                                Print whole result in interactive Scala console
                            
                                How to access LayoutManager from RecyclerView.Adapter to get scrollToPosition?
                            
                                Why does Seq.contains accept type Any rather than the type parameter A?
                            
                                How to find index of element with minimum value?
                            
                                Why does the type parameter bound T <: Comparable[T] fail for T = Int?
                            
                                What does "no global type inference" mean regarding Scala?
                            
                                What is the Java equivalent of a Scala object?
                            
                                Creating `**` power operator for Scala?
                            
                                Compare two Maps in Scala
                            
                                Spark Dataset select with typedcolumn

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

When are cache and persist executed (since they don't seem like actions)?

Tags:

scala

lazy-evaluation

apache-spark

Ankit Khettry

People also ask

1 Answers

Jacek Laskowski

Recent Activity

Donate For Us