Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark cache vs broadcast

It looks like broadcast method makes a distributed copy of RDD in my cluster. On the other hand execution of cache() method simply loads data in memory.

But I do not understand how does cached RDD is distributed in the cluster.

Could you please tell me in what cases should I use rdd.cache() and rdd.broadcast() methods?

like image 625
dmreshet Avatar asked Jun 27 '16 14:06

dmreshet


People also ask

What is difference between broadcast and cache in Spark?

Caching is a key tool for iterative algorithms and fast interactive use. Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner.

What is the difference between cache and persist in Spark?

Spark Cache vs Persist Both caching and persisting are used to save the Spark RDD, Dataframe, and Dataset's. But, the difference is, RDD cache() method default saves it to memory (MEMORY_ONLY) whereas persist() method is used to store it to the user-defined storage level.

What does Spark cache do?

cache() is an Apache Spark transformation that can be used on a DataFrame, Dataset, or RDD when you want to perform more than one action. cache() caches the specified DataFrame, Dataset, or RDD in the memory of your cluster's workers.


3 Answers

cache() or persist() allows a dataset to be used across operations.

When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it). This allows future actions to be much faster (often by more than 10x). Caching is a key tool for iterative algorithms and fast interactive use.

Each persisted RDD can be stored using a different storage level, allowing you, for example, to persist the dataset on disk, persist it in memory but as serialized Java objects (to save space), replicate it across nodes, or store it off-heap

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.

You can find more details at this documentation page.

Useful posts:

Advantage of Broadcast Variables

What is the difference between cache and persist?

like image 121
Ravindra babu Avatar answered Oct 09 '22 15:10

Ravindra babu


Could you please tell me in what cases should I use rdd.cache() and rdd.broadcast() methods?

RDDs are divided into partitions. These partitions themselves act as an immutable subset of the entire RDD. When Spark executes each stage of the graph, each partition gets sent to a worker which operates on the subset of the data. In turn, each worker can cache the data if the RDD needs to be re-iterated.

Broadcast variables are used to send some immutable state once to each worker. You use them when you want a local copy of a variable.

These two operations are quite different from each other, and each one represents a solution to a different problem.

like image 44
Yuval Itzchakov Avatar answered Oct 09 '22 14:10

Yuval Itzchakov


Could you please tell me in what cases should I use rdd.cache() and rdd.broadcast() methods?

Let's take an example -- say suppose you have an employee_salary data that contains department and salary of every employee. Now say the task is to find the fraction of average departmental salary for each employee. (If for employee e1 his dept is d1, we need to find e1.salary/average(all salaries in d1)).

Now one way to do this is -- you first read the data into an rdd -- say rdd1. And then do two things one after the other*-

First, calculate the department wise salary average using the rdd1*. You will eventually have the department average salaries result -- basically a map object containing of deptId vs average -- on the driver.

Second, you will need to use this result to divide the salary for each employee by their respective department's average salary. Remember that on each worker there can be employees from any department, so you will need to have access to the department wise average salary result on each worker. How to do this? Well, you can just send the average salary map you got on the driver to each worker in a broadcast and it can then be used in calculating the salary fractions for every "row" in the rdd1.

What about the caching an RDD? Remember that from the initial rdd1, there are two branches of computations -- one for calculating dept wise average and another of applying these averages on each employee in the rdd. Now, if you do not cache the rdd1, then for the second task above you may need to go back to disk again to read and recompute it because spark may have evicted this rdd from memory by the time you reach this point. But since we know that we will be using the same rdd we can ask Spark to keep it in memory the first time itself. Then next time we need to apply some transformations on it, we already have it in memory.

*We can use dept based partitioning so you can avoid the broadcast but for the purpose of illustration, let's say we do not.

like image 9
Sachin Tyagi Avatar answered Oct 09 '22 14:10

Sachin Tyagi