When a resilient distributed dataset (RDD) is created from a text file or collection (or from another RDD), do we need to call "cache" or "persist" explicitly to store the RDD data into memory? Or is the RDD data stored in a distributed way in the memory by default? <pre class="prettyprint"><code>val textFile = sc.textFile("/user/emp.txt") </code></pre> As per my understanding, after the above step, textFile is a RDD and is available in all/some of the node's memory. If so, why do we need to call "cache" or "persist" on textFile RDD then?

Most RDD operations are lazy. Think of an RDD as a description of a series of operations. An RDD is not data. So this line: <pre class="prettyprint"><code>val textFile = sc.textFile("/user/emp.txt") </code></pre> It does nothing. It creates an RDD that says "we will need to load this file". The file is not loaded at this point. RDD operations that require observing the contents of the data cannot be lazy. (These are called actions.) An example is <code>RDD.count</code> — to tell you the number of lines in the file, the file needs to be read. So if you write <code>textFile.count</code>, at this point the file will be read, the lines will be counted, and the count will be returned. What if you call <code>textFile.count</code> again? The same thing: the file will be read and counted again. Nothing is stored. An RDD is not data. So what does <code>RDD.cache</code> do? If you add <code>textFile.cache</code> to the above code: <pre class="prettyprint"><code>val textFile = sc.textFile("/user/emp.txt") textFile.cache </code></pre> It does nothing. <code>RDD.cache</code> is also a lazy operation. The file is still not read. But now the RDD says "read this file and then cache the contents". If you then run <code>textFile.count</code> the first time, the file will be loaded, cached, and counted. If you call <code>textFile.count</code> a second time, the operation will use the cache. It will just take the data from the cache and count the lines. The cache behavior depends on the available memory. If the file does not fit in the memory, for example, then <code>textFile.count</code> will fall back to the usual behavior and re-read the file.

(Why) do we need to call cache or persist on a RDD

Tags:

scala

apache-spark

rdd

When a resilient distributed dataset (RDD) is created from a text file or collection (or from another RDD), do we need to call "cache" or "persist" explicitly to store the RDD data into memory? Or is the RDD data stored in a distributed way in the memory by default?

val textFile = sc.textFile("/user/emp.txt")

As per my understanding, after the above step, textFile is a RDD and is available in all/some of the node's memory.

If so, why do we need to call "cache" or "persist" on textFile RDD then?

553

asked Oct 05 '22 05:10

Ramana

1 Answers

Most RDD operations are lazy. Think of an RDD as a description of a series of operations. An RDD is not data. So this line:

val textFile = sc.textFile("/user/emp.txt")

It does nothing. It creates an RDD that says "we will need to load this file". The file is not loaded at this point.

RDD operations that require observing the contents of the data cannot be lazy. (These are called actions.) An example is RDD.count — to tell you the number of lines in the file, the file needs to be read. So if you write textFile.count, at this point the file will be read, the lines will be counted, and the count will be returned.

What if you call textFile.count again? The same thing: the file will be read and counted again. Nothing is stored. An RDD is not data.

So what does RDD.cache do? If you add textFile.cache to the above code:

val textFile = sc.textFile("/user/emp.txt")
textFile.cache

It does nothing. RDD.cache is also a lazy operation. The file is still not read. But now the RDD says "read this file and then cache the contents". If you then run textFile.count the first time, the file will be loaded, cached, and counted. If you call textFile.count a second time, the operation will use the cache. It will just take the data from the cache and count the lines.

The cache behavior depends on the available memory. If the file does not fit in the memory, for example, then textFile.count will fall back to the usual behavior and re-read the file.

135

answered Oct 21 '22 10:10

Daniel Darabos

Related questions
                            
                                Use of def, val, and var in scala
                            
                                How to sort by column in descending order in Spark SQL?
                            
                                Scala: write string to file in one statement
                            
                                scala vs java, performance and memory? [closed]
                            
                                How to turn off INFO logging in Spark?
                            
                                What's the standard way to work with dates and times in Scala? Should I use Java types or there are native Scala alternatives?
                            
                                How to read environment variables in Scala
                            
                                What's the (hidden) cost of Scala's lazy val?
                            
                                Build.scala, % and %% symbols meaning
                            
                                Scala best way of turning a Collection into a Map-by-key?
                            
                                How can I change column types in Spark SQL's DataFrame?
                            
                                Is asynchronous jdbc call possible?
                            
                                What is the Scala identifier "implicitly"?
                            
                                ScalaTest in sbt: is there a way to run a single test without tags?
                            
                                Logging in Scala
                            
                                How to select the first row of each group?
                            
                                Best way to merge two maps and sum the values of same key?
                            
                                Getting a structural type with an anonymous class's methods from a macro
                            
                                Add jars to a Spark Job - spark-submit
                            
                                Scala equivalent of Java java.lang.Class<T> Object

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With