It looks like broadcast method makes a distributed copy of RDD in my cluster. On the other hand execution of cache() method simply loads data in memory. But I do not understand how does cached RDD is distributed in the cluster. Could you please tell me in what cases should I use <code>rdd.cache()</code> and <code>rdd.broadcast()</code> methods?

cache() or persist() allows a dataset to be used across operations. When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it). This allows future actions to be much faster (often by more than 10x). Caching is a key tool for iterative algorithms and fast interactive use. Each persisted RDD can be stored using a different storage level, allowing you, for example, to persist the dataset on disk, persist it in memory but as serialized Java objects (to save space), replicate it across nodes, or store it off-heap Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost. You can find more details at this documentation page. Useful posts: Advantage of Broadcast Variables What is the difference between cache and persist?

Spark cache vs broadcast

3 Answers

cache() or persist() allows a dataset to be used across operations.

When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it). This allows future actions to be much faster (often by more than 10x). Caching is a key tool for iterative algorithms and fast interactive use.

Each persisted RDD can be stored using a different storage level, allowing you, for example, to persist the dataset on disk, persist it in memory but as serialized Java objects (to save space), replicate it across nodes, or store it off-heap

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.

You can find more details at this documentation page.

Useful posts:

Advantage of Broadcast Variables

What is the difference between cache and persist?

121

answered Oct 09 '22 15:10

Ravindra babu

Could you please tell me in what cases should I use rdd.cache() and rdd.broadcast() methods?

RDDs are divided into partitions. These partitions themselves act as an immutable subset of the entire RDD. When Spark executes each stage of the graph, each partition gets sent to a worker which operates on the subset of the data. In turn, each worker can cache the data if the RDD needs to be re-iterated.

Broadcast variables are used to send some immutable state once to each worker. You use them when you want a local copy of a variable.

These two operations are quite different from each other, and each one represents a solution to a different problem.

answered Oct 09 '22 14:10

Yuval Itzchakov

Could you please tell me in what cases should I use rdd.cache() and rdd.broadcast() methods?

Let's take an example -- say suppose you have an employee_salary data that contains department and salary of every employee. Now say the task is to find the fraction of average departmental salary for each employee. (If for employee e1 his dept is d1, we need to find e1.salary/average(all salaries in d1)).

Now one way to do this is -- you first read the data into an rdd -- say rdd1. And then do two things one after the other*-

First, calculate the department wise salary average using the rdd1*. You will eventually have the department average salaries result -- basically a map object containing of deptId vs average -- on the driver.

Second, you will need to use this result to divide the salary for each employee by their respective department's average salary. Remember that on each worker there can be employees from any department, so you will need to have access to the department wise average salary result on each worker. How to do this? Well, you can just send the average salary map you got on the driver to each worker in a broadcast and it can then be used in calculating the salary fractions for every "row" in the rdd1.

What about the caching an RDD? Remember that from the initial rdd1, there are two branches of computations -- one for calculating dept wise average and another of applying these averages on each employee in the rdd. Now, if you do not cache the rdd1, then for the second task above you may need to go back to disk again to read and recompute it because spark may have evicted this rdd from memory by the time you reach this point. But since we know that we will be using the same rdd we can ask Spark to keep it in memory the first time itself. Then next time we need to apply some transformations on it, we already have it in memory.

*We can use dept based partitioning so you can avoid the broadcast but for the purpose of illustration, let's say we do not.

answered Oct 09 '22 14:10

Sachin Tyagi

Related questions
                            
                                Can lombok decrease performance?
                            
                                How do you control the size of the output file?
                            
                                Understanding JavaScript promise object
                            
                                Testing - Can't resolve all parameters for (ClassName)
                            
                                React store.getState is not a function
                            
                                AngularJS: How to resolve "Attempting to use an unsafe value in a safe context"?
                            
                                Benefit of const vs let in TypeScript (or Javascript)
                            
                                Pass Node.js environment variable with Windows PowerShell [duplicate]
                            
                                React Router v4 nested routes props.children
                            
                                Collection Where LIKE Laravel 5.4
                            
                                Kivy not working (Error: Unable to find any valuable Window provider.)
                            
                                How to generate new image asset in Android Studio

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark cache vs broadcast

Tags:

caching

apache-spark

dmreshet

People also ask

3 Answers

Ravindra babu

Yuval Itzchakov

Sachin Tyagi

Recent Activity

Donate For Us