Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

(Why) do we need to call cache or persist on a RDD

When a resilient distributed dataset (RDD) is created from a text file or collection (or from another RDD), do we need to call "cache" or "persist" explicitly to store the RDD data into memory? Or is the RDD data stored in a distributed way in the memory by default?

val textFile = sc.textFile("/user/emp.txt")

As per my understanding, after the above step, textFile is a RDD and is available in all/some of the node's memory.

If so, why do we need to call "cache" or "persist" on textFile RDD then?

like image 553
Ramana Avatar asked Oct 05 '22 05:10

Ramana


People also ask

What is the use of cache and persist in Spark?

Caching or persistence are optimization techniques for (iterative and interactive) Spark computations. They help saving interim partial results so they can be reused in subsequent stages. These interim results as RDD s are thus kept in memory (default) or more solid storage like disk and/or replicated.

Why do we need cache in Spark?

Benefits of caching DataFrame By caching you create a checkpoint in your spark application and if further down the execution of application any of the tasks fail your application will be able to recompute the lost RDD partition from the cache.

What does RDD cache do?

Caching is an optimization technique for iterative and interactive computations. Caching helps in saving interim, partial results so they can be reused in subsequent stages of computation. These interim results are stored as RDDs (Resilient Distributed Datasets) and are kept either in memory (by default) or on disk.

When should I use persist in Spark?

Spark RDD persistence is an optimization technique which saves the result of RDD evaluation in cache memory. Using this we save the intermediate result so that we can use it further if required. It reduces the computation overhead.


1 Answers

Most RDD operations are lazy. Think of an RDD as a description of a series of operations. An RDD is not data. So this line:

val textFile = sc.textFile("/user/emp.txt")

It does nothing. It creates an RDD that says "we will need to load this file". The file is not loaded at this point.

RDD operations that require observing the contents of the data cannot be lazy. (These are called actions.) An example is RDD.count — to tell you the number of lines in the file, the file needs to be read. So if you write textFile.count, at this point the file will be read, the lines will be counted, and the count will be returned.

What if you call textFile.count again? The same thing: the file will be read and counted again. Nothing is stored. An RDD is not data.

So what does RDD.cache do? If you add textFile.cache to the above code:

val textFile = sc.textFile("/user/emp.txt")
textFile.cache

It does nothing. RDD.cache is also a lazy operation. The file is still not read. But now the RDD says "read this file and then cache the contents". If you then run textFile.count the first time, the file will be loaded, cached, and counted. If you call textFile.count a second time, the operation will use the cache. It will just take the data from the cache and count the lines.

The cache behavior depends on the available memory. If the file does not fit in the memory, for example, then textFile.count will fall back to the usual behavior and re-read the file.

like image 135
Daniel Darabos Avatar answered Oct 21 '22 10:10

Daniel Darabos