Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache spark in memory caching

Spark caches the working dataset into memory and then performs computations at memory speeds. Is there a way to control how long the working set resides in RAM?

I have a huge amount of data that is accessed through the job. It takes time to load the job initially to RAM and when the next job arrives, it has to load all the data again to RAM which is time consuming. Is there a way to cache the data forever(or for specified time) into RAM using Spark?

like image 332
Atom Avatar asked Nov 11 '14 05:11

Atom


People also ask

Does Spark cache the data automatically in the memory as and when needed?

Spark automatically monitors every persist() and cache() calls you make and it checks usage on each node and drops persisted data if not used or by using the least-recently-used (LRU) algorithm.

When should we use cache in Spark?

Caching is recommended in the following situations: For RDD re-use in iterative machine learning applications. For RDD re-use in standalone Spark applications. When RDD computation is expensive, caching can help in reducing the cost of recovery in the case one executor fails.

Can we cache DataFrame in Spark?

cache() is an Apache Spark transformation that can be used on a DataFrame, Dataset, or RDD when you want to perform more than one action. cache() caches the specified DataFrame, Dataset, or RDD in the memory of your cluster's workers.

What is memory_and_disk in Apache Spark?

The default strategy in Apache Spark is MEMORY_AND_DISK and it is fine for the majority of pipelines and uses all the available memory in the cluster and thus speeds up the operations. If there is not enough memory for caching then Spark in this strategy saves the data on disk — reading blocks from disk is usually faster than re-evaluating.

What is cache in Apache Spark?

Using cache appropriately within Apache Spark allows you to be a master over your available resources. Memory is not free, although it can be cheap, but in many cases the cost to store a DataFrame in memory is actually more expensive in the long run than going back to the source of truth dataset.

What are the levels of data persistence in Apache Spark?

There are several levels of data persistence in Apache Spark: MEMORY_ONLY. Data is cached in memory in unserialized format only. MEMORY_AND_DISK. Data is cached in memory. If memory is insufficient, the evicted blocks from memory are serialized to disk.

How does spark store data in memory?

Introduction to Spark In-memory Computing Keeping the data in-memory improves the performance by an order of magnitudes. The main abstraction of Spark is its RDDs. And the RDDs are cached using the cache () or persist () method. When we use cache () method, all the RDD stores in-memory.


1 Answers

To uncache explicitly, you can use RDD.unpersist()

If you want to share cached RDDs across multiple jobs you can try the following:

  1. Cache the RDD using a same context and re-use the context for other jobs. This way you only cache once and use it many times
  2. There are 'spark job servers' that exist to do the above mentioned functionality. Checkout Spark Job Server open sourced by Ooyala.
  3. Use an external caching solution like Tachyon

I have been experimenting with caching options in Spark. You can read more here : http://sujee.net/understanding-spark-caching/

like image 175
Sujee Maniyam Avatar answered Oct 22 '22 11:10

Sujee Maniyam