Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the purpose of cache an RDD in Apache Spark?

I am new for Apache Spark and I have couple of basic questions in spark which I could not understand while reading the spark material. Every materials have their own style of explanation. I am using PySpark Jupyter notebook on Ubuntu to practice.

As per my understanding, When I run the below command, the data in the testfile.csv is partitioned and stored in memory of the respective nodes.( actually I know its a lazy evaluation and it will not process until it sees action command ), but still the concept is

rdd1 = sc.textFile("testfile.csv")

My question is when I run the below transformation and action command, where does the rdd2 data will store.

1.Does it stores in memory?

rdd2 = rdd1.map( lambda x: x.split(",") )

rdd2.count()

I know the data in rdd2 will available till I close the jupyter notebook.Then what is the need of cache(), anyhow rdd2 is available to do all transformation. I heard after all the transformation the data in memory is cleared, what is that about?

  1. Is there any difference between keeping RDD in memory and cache()

    rdd2.cache()

like image 934
Wanderer Avatar asked Jul 26 '16 05:07

Wanderer


1 Answers

Does it stores in memory?

When you run a spark transformation via an action (count, print, foreach), then, and only then is your graph being materialized and in your case the file is being consumed. RDD.cache purpose it to make sure that the result of sc.textFile("testfile.csv") is available in memory and isn't needed to be read over again.

Don't confuse the variable with the actual operations that are being done behind the scenes. Caching allows you to re-iterate the data, making sure it is in memory (if there is sufficient memory to store it in it's entirety) if you want to re-iterate the said RDD, and as long as you've set the right storage level (which defaults to StorageLevel.MEMORY). From the documentation (Thanks @RockieYang):

In addition, each persisted RDD can be stored using a different storage level, allowing you, for example, to persist the dataset on disk, persist it in memory but as serialized Java objects (to save space), replicate it across nodes, or store it off-heap in Tachyon. These levels are set by passing a StorageLevel object (Scala, Java, Python) to persist(). The cache() method is a shorthand for using the default storage level, which is StorageLevel.MEMORY_ONLY (store deserialized objects in memory).

You can mark an RDD to be persisted using the persist() or cache() methods on it. The first time it is computed in an action, it will be kept in memory on the nodes. Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it.


Is there any difference between keeping RDD in memory and cache()

As stated above, you keep it in memory via cache, as long as you've provided the right storage level. Otherwise, it won't necessarily be kept in memory at the time you want to re-use it.

like image 65
Yuval Itzchakov Avatar answered Oct 13 '22 16:10

Yuval Itzchakov